====== FASTA ======
**[[http://en.wikipedia.org/wiki/FASTA_format|wikipedia: FASTA format]]**\\
[[http://www.ncbi.nlm.nih.gov/blast/fasta.shtml|NCBI's FASTA format description]]\\
\\
FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
===== format information =====
* text based
* no standard file extension for a text file containing FASTA formatted sequences. FASTA format files often have file extensions like .fa, .mpfa, .fna, .fsa, .fas or .fasta
===== Data type handled =====
* nucleic acid sequences
* peptide sequences
===== file format =====
* begins with a single-line description, followed by lines of sequence data
* It is recommended that all lines of text be shorter than 80 characters
* The sequence ends if another line starting with a ">" appears (this indicates the start of another sequence
**simple examples:**
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY
example of a multiple sequence FASTA file:
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
=== Header line ===
* begins with ">"
* word following is the identifier and/or name of the sequence (optional)
* rest of the line is the description (optional)
* no space between the ">" and the first letter of the identifier
* header line may contain more than one header separated by a ^A (Control-A) character
\\
* Sequence identifiers:
* Many different sequence databases use standardized headers, which helps when automatically extracting information from the header
* NCBI defined a standard for the unique identifier
* they do not give a definitive description of the FASTA defline format, an attempt to create such a format:
| GenBank | ''gi|gi-number|gb|accession|locus'' |
| EMBL Data Library | ''gi|gi-number|emb|accession|locus'' |
| DDBJ, DNA Database of Japan | ''gi|gi-number|dbj|accession|locus'' |
| NBRF PIR | ''pir||entry'' |
| Protein Research Foundation | ''prf||name'' |
| SWISS-PROT | ''sp|accession|name'' |
| Brookhaven Protein Data Bank (1) | ''pdb|entry|chain'' |
| Brookhaven Protein Data Bank (2) | ''entry:chain|PDBID|CHAIN|SEQUENCE'' |
| Patents | ''pat|country|number'' |
| GenInfo Backbone Id | ''bbs|number'' |
| General database identifier | ''gnl|database|identifier'' |
| NCBI Reference Sequence | ''ref|accession|locus'' |
| Local Sequence identifier | ''lcl|identifier'' |
//Anm: Die gi-Nummer ist eine Abfolge von Ziffern, die einen Datenbankeintrag des NCBI markiert.//
\\
=== Sequence representation ===
* After the header line and comments
* each line of a sequence should have fewer than 80 characters
* Sequences may be protein sequences or nucleic acid sequences
* can contain gaps or alignment characters
* Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:
* lower-case letters are accepted and are mapped into upper-case
* a single hyphen or dash can be used to represent a gap character
* in amino acid sequences: U and * are acceptable letters
* Numerical digits are not allowed but are used in some databases to indicate the position in the sequence
\\
The nucleic acid codes supported are:
^ Nucleic Acid Code ^ Meaning ^
| A | Adenosine |
| C | Cytidine |
| G | Guanine |
| T | Thymidine |
| U | Uracil |
| R | G A (puRine) |
| Y | T C (pYrimidine) |
| K | G T (Ketone) |
| M | A C (aMino group) |
| S | G C (Strong interaction) |
| W | A T (Weak interaction) |
| B | G T C (not A) (B comes after A) |
| D | G A T (not C) (D comes after C) |
| H | A C T (not G) (H comes after G) |
| V | G C A (not T, not U) (V comes after U) |
| N | A G C T (aNy) |
| X | masked |
| - | gap of indeterminate length |
\\
The amino acid codes supported are:
^ Amino Acid Code ^ Meaning ^
| A | Alanine |
| B | Aspartic acid or Asparagine |
| C | Cysteine |
| D | Aspartic acid |
| E | Glutamic acid |
| F | Phenylalanine |
| G | Glycine |
| H | Histidine |
| I | Isoleucine |
| K | Lysine |
| L | Leucine |
| M | Methionine |
| N | Asparagine |
| P | Proline |
| Q | Glutamine |
| R | Arginine |
| S | Serine |
| T | Threonine |
| U | Selenocysteine |
| V | Valine |
| W | Tryptophan |
| Y | Tyrosine |
| Z | Glutamic acid or Glutamine |
| X | any |
| * | translation stop |
| - | gap of indeterminate length|
===== converter =====
[[http://iubio.bio.indiana.edu/soft/molbio/readseq/|Readseq]] for converting sequence formats to FASTA \\
[[http://www.bugaco.com/bioinf/|Nexus to Fasta converter]]\\
[[http://gp2fasta.ovh.org/|GenBank to Fasta conventer]]