fasta
Table of Contents
FASTA
wikipedia: FASTA format
NCBI's FASTA format description
FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
format information
- text based
- no standard file extension for a text file containing FASTA formatted sequences. FASTA format files often have file extensions like .fa, .mpfa, .fna, .fsa, .fas or .fasta
Data type handled
- nucleic acid sequences
- peptide sequences
file format
- begins with a single-line description, followed by lines of sequence data
- It is recommended that all lines of text be shorter than 80 characters
- The sequence ends if another line starting with a “>” appears (this indicates the start of another sequence
simple examples:
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY
example of a multiple sequence FASTA file:
>SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
Header line
- begins with “>”
- word following is the identifier and/or name of the sequence (optional)
- rest of the line is the description (optional)
- no space between the “>” and the first letter of the identifier
- header line may contain more than one header separated by a ^A (Control-A) character
- Sequence identifiers:
- Many different sequence databases use standardized headers, which helps when automatically extracting information from the header
- NCBI defined a standard for the unique identifier
- they do not give a definitive description of the FASTA defline format, an attempt to create such a format:
| GenBank | gi|gi-number|gb|accession|locus |
| EMBL Data Library | gi|gi-number|emb|accession|locus |
| DDBJ, DNA Database of Japan | gi|gi-number|dbj|accession|locus |
| NBRF PIR | pir||entry |
| Protein Research Foundation | prf||name |
| SWISS-PROT | sp|accession|name |
| Brookhaven Protein Data Bank (1) | pdb|entry|chain |
| Brookhaven Protein Data Bank (2) | entry:chain|PDBID|CHAIN|SEQUENCE |
| Patents | pat|country|number |
| GenInfo Backbone Id | bbs|number |
| General database identifier | gnl|database|identifier |
| NCBI Reference Sequence | ref|accession|locus |
| Local Sequence identifier | lcl|identifier |
Anm: Die gi-Nummer ist eine Abfolge von Ziffern, die einen Datenbankeintrag des NCBI markiert.
Sequence representation
- After the header line and comments
- each line of a sequence should have fewer than 80 characters
- Sequences may be protein sequences or nucleic acid sequences
- can contain gaps or alignment characters
- Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:
- lower-case letters are accepted and are mapped into upper-case
- a single hyphen or dash can be used to represent a gap character
- in amino acid sequences: U and * are acceptable letters
- Numerical digits are not allowed but are used in some databases to indicate the position in the sequence
The nucleic acid codes supported are:
| Nucleic Acid Code | Meaning |
|---|---|
| A | Adenosine |
| C | Cytidine |
| G | Guanine |
| T | Thymidine |
| U | Uracil |
| R | G A (puRine) |
| Y | T C (pYrimidine) |
| K | G T (Ketone) |
| M | A C (aMino group) |
| S | G C (Strong interaction) |
| W | A T (Weak interaction) |
| B | G T C (not A) (B comes after A) |
| D | G A T (not C) (D comes after C) |
| H | A C T (not G) (H comes after G) |
| V | G C A (not T, not U) (V comes after U) |
| N | A G C T (aNy) |
| X | masked |
| - | gap of indeterminate length |
The amino acid codes supported are:
| Amino Acid Code | Meaning |
|---|---|
| A | Alanine |
| B | Aspartic acid or Asparagine |
| C | Cysteine |
| D | Aspartic acid |
| E | Glutamic acid |
| F | Phenylalanine |
| G | Glycine |
| H | Histidine |
| I | Isoleucine |
| K | Lysine |
| L | Leucine |
| M | Methionine |
| N | Asparagine |
| P | Proline |
| Q | Glutamine |
| R | Arginine |
| S | Serine |
| T | Threonine |
| U | Selenocysteine |
| V | Valine |
| W | Tryptophan |
| Y | Tyrosine |
| Z | Glutamic acid or Glutamine |
| X | any |
| * | translation stop |
| - | gap of indeterminate length |
converter
Readseq for converting sequence formats to FASTA
Nexus to Fasta converter
GenBank to Fasta conventer
fasta.txt · Last modified: 2008/07/22 13:31 by 127.0.0.1