Differences

This shows you the differences between two versions of the page.

--- fasta [2007/12/17 14:47] – heidi
+++ fasta [2008/07/22 13:31] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
 ====== FASTA ======
+**[[http://en.wikipedia.org/wiki/FASTA_format|wikipedia: FASTA format]]**\\
+[[http://www.ncbi.nlm.nih.gov/blast/fasta.shtml|NCBI's FASTA format description]]\\
+\\
 FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
 ===== format information =====
   * text based
+  * no standard file extension for a text file containing FASTA formatted sequences. FASTA format files often have file extensions like .fa, .mpfa, .fna, .fsa, .fas or .fasta
 ===== Data type handled =====
   * nucleic acid sequences
   * peptide sequences
 ===== file format =====
-===== How to cite =====
+  * begins with a single-line description, followed by lines of sequence data
+  * It is recommended that all lines of text be shorter than 80 characters
+  * The sequence ends if another line starting with a ">" appears (this indicates the start of another sequence
+**simple examples:**
+<code>
+>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
+LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
+EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
+LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
+GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
+IENY
+</code>
+example of a multiple sequence FASTA file:
+<code>
+>SEQUENCE_1
+MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
+LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
+IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
+MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
+>SEQUENCE_2
+SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
+ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
+</code>
+=== Header line ===
+  * begins with ">"
+  * word following is the identifier and/or name of the sequence (optional)
+  * rest of the line is the description (optional)
+  * no space between the ">" and the first letter of the identifier
+  * header line may contain more than one header separated by a ^A (Control-A) character
+\\
+  * Sequence identifiers:
+    * Many different sequence databases use standardized headers, which helps when automatically extracting information from the header
+    * NCBI defined a standard for the unique identifier
+    * they do not give a definitive description of the FASTA defline format, an attempt to create such a format:
+| GenBank                          | ''gi|gi-number|gb|accession|locus'' |
+| EMBL Data Library                | ''gi|gi-number|emb|accession|locus'' |
+| DDBJ, DNA Database of Japan      | ''gi|gi-number|dbj|accession|locus'' |
+| NBRF PIR                         | ''pir||entry'' |
+| Protein Research Foundation      | ''prf||name'' |
+| SWISS-PROT                       | ''sp|accession|name'' |
+| Brookhaven Protein Data Bank (1) | ''pdb|entry|chain'' |
+| Brookhaven Protein Data Bank (2) | ''entry:chain|PDBID|CHAIN|SEQUENCE'' |
+| Patents                          | ''pat|country|number'' |
+| GenInfo Backbone Id              | ''bbs|number'' |
+| General database identifier      | ''gnl|database|identifier'' |
+| NCBI Reference Sequence          | ''ref|accession|locus'' |
+| Local Sequence identifier        | ''lcl|identifier'' |
+//Anm: Die gi-Nummer ist eine Abfolge von Ziffern, die einen Datenbankeintrag des NCBI markiert.//
+\\
+=== Sequence representation ===
+  * After the header line and comments
+  * each line of a sequence should have fewer than 80 characters
+  * Sequences may be protein sequences or nucleic acid sequences
+  * can contain gaps or alignment characters
+  * Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:
+    * lower-case letters are accepted and are mapped into upper-case
+    * a single hyphen or dash can be used to represent a gap character
+    * in amino acid sequences: U and * are acceptable letters
+  * Numerical digits are not allowed but are used in some databases to indicate the position in the sequence
+\\
+The nucleic acid codes supported are:
+^ Nucleic Acid Code ^  Meaning  ^
+|       A           | Adenosine |
+|       C           | Cytidine |
+|       G           | Guanine |
+|       T           | Thymidine |
+|       U           | Uracil |
+|       R           | G A (puRine) |
+|       Y           | T C (pYrimidine) |
+|       K           | G T (Ketone) |
+|       M           | A C (aMino group) |
+|       S           | G C (Strong interaction) |
+|       W           | A T (Weak interaction) |
+|       B           | G T C (not A) (B comes after A) |
+|       D           | G A T (not C) (D comes after C) |
+|       H           | A C T (not G) (H comes after G) |
+|       V           | G C A (not T, not U) (V comes after U) |
+|       N           | A G C T (aNy) |
+|       X           | masked |
+|       -           | gap of indeterminate length |
+\\
+The amino acid codes supported are:
+^ Amino Acid Code ^ Meaning ^
+|       A         | Alanine |
+|       B         | Aspartic acid or Asparagine |
+|       C         | Cysteine |
+|       D         | Aspartic acid |
+|       E         | Glutamic acid |
+|       F         | Phenylalanine |
+|       G         | Glycine |
+|       H         | Histidine |
+|       I         | Isoleucine |
+|       K         | Lysine |
+|       L         | Leucine |
+|       M         | Methionine |
+|       N         | Asparagine |
+|       P         | Proline |
+|       Q         | Glutamine |
+|       R         | Arginine |
+|       S         | Serine |
+|       T         | Threonine |
+|       U         | Selenocysteine |
+|       V         | Valine |
+|       W         | Tryptophan |
+|       Y         | Tyrosine |
+|       Z         | Glutamic acid or Glutamine |
+|       X         | any |
+|       *         | translation stop |
+|       -         | gap of indeterminate length|
+===== converter =====
+[[http://iubio.bio.indiana.edu/soft/molbio/readseq/|Readseq]] for converting sequence formats to FASTA \\
+[[http://www.bugaco.com/bioinf/|Nexus to Fasta converter]]\\
+[[http://gp2fasta.ovh.org/|GenBank to Fasta conventer]]