fasta
Table of Contents
FASTA
wikipedia: FASTA format
NCBI's FASTA format description
FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
format information
- text based
- no standard file extension for a text file containing FASTA formatted sequences. FASTA format files often have file extensions like .fa, .mpfa, .fna, .fsa, .fas or .fasta
Data type handled
- nucleic acid sequences
- peptide sequences
file format
- begins with a single-line description, followed by lines of sequence data
- It is recommended that all lines of text be shorter than 80 characters
- The sequence ends if another line starting with a “>” appears (this indicates the start of another sequence
simple examples:
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY
example of a multiple sequence FASTA file:
>SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
Header line
- begins with “>”
- word following is the identifier and/or name of the sequence (optional)
- rest of the line is the description (optional)
- no space between the “>” and the first letter of the identifier
- header line may contain more than one header separated by a ^A (Control-A) character
- Sequence identifiers:
- Many different sequence databases use standardized headers, which helps when automatically extracting information from the header
- NCBI defined a standard for the unique identifier
- they do not give a definitive description of the FASTA defline format, an attempt to create such a format:
GenBank | gi|gi-number|gb|accession|locus |
EMBL Data Library | gi|gi-number|emb|accession|locus |
DDBJ, DNA Database of Japan | gi|gi-number|dbj|accession|locus |
NBRF PIR | pir||entry |
Protein Research Foundation | prf||name |
SWISS-PROT | sp|accession|name |
Brookhaven Protein Data Bank (1) | pdb|entry|chain |
Brookhaven Protein Data Bank (2) | entry:chain|PDBID|CHAIN|SEQUENCE |
Patents | pat|country|number |
GenInfo Backbone Id | bbs|number |
General database identifier | gnl|database|identifier |
NCBI Reference Sequence | ref|accession|locus |
Local Sequence identifier | lcl|identifier |
Anm: Die gi-Nummer ist eine Abfolge von Ziffern, die einen Datenbankeintrag des NCBI markiert.
Sequence representation
- After the header line and comments
- each line of a sequence should have fewer than 80 characters
- Sequences may be protein sequences or nucleic acid sequences
- can contain gaps or alignment characters
- Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:
- lower-case letters are accepted and are mapped into upper-case
- a single hyphen or dash can be used to represent a gap character
- in amino acid sequences: U and * are acceptable letters
- Numerical digits are not allowed but are used in some databases to indicate the position in the sequence
The nucleic acid codes supported are:
Nucleic Acid Code | Meaning |
---|---|
A | Adenosine |
C | Cytidine |
G | Guanine |
T | Thymidine |
U | Uracil |
R | G A (puRine) |
Y | T C (pYrimidine) |
K | G T (Ketone) |
M | A C (aMino group) |
S | G C (Strong interaction) |
W | A T (Weak interaction) |
B | G T C (not A) (B comes after A) |
D | G A T (not C) (D comes after C) |
H | A C T (not G) (H comes after G) |
V | G C A (not T, not U) (V comes after U) |
N | A G C T (aNy) |
X | masked |
- | gap of indeterminate length |
The amino acid codes supported are:
Amino Acid Code | Meaning |
---|---|
A | Alanine |
B | Aspartic acid or Asparagine |
C | Cysteine |
D | Aspartic acid |
E | Glutamic acid |
F | Phenylalanine |
G | Glycine |
H | Histidine |
I | Isoleucine |
K | Lysine |
L | Leucine |
M | Methionine |
N | Asparagine |
P | Proline |
Q | Glutamine |
R | Arginine |
S | Serine |
T | Threonine |
U | Selenocysteine |
V | Valine |
W | Tryptophan |
Y | Tyrosine |
Z | Glutamic acid or Glutamine |
X | any |
* | translation stop |
- | gap of indeterminate length |
converter
Readseq for converting sequence formats to FASTA
Nexus to Fasta converter
GenBank to Fasta conventer
fasta.txt · Last modified: 2008/07/22 13:31 by 127.0.0.1