This is an old revision of the document!

FASTA

FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.

format information

text based

Data type handled

nucleic acid sequences
peptide sequences

file format

begins with a single-line description, followed by lines of sequence data
It is recommended that all lines of text be shorter than 80 characters
The sequence ends if another line starting with a “>” appears (this indicates the start of another sequence

simple examples:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

example of a multiple sequence FASTA file:

>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Header line

begins with “>”
word following is the identifier and/or name of the sequence (optional)
rest of the line is the description (optional)
no space between the “>” and the first letter of the identifier
header line may contain more than one header separated by a ^A (Control-A) character

Sequence identifiers:

Many different sequence databases use standardized headers, which helps when automatically extracting information from the header
NCBI defined a standard for the unique identifier

they do not give a definitive description of the FASTA defline format, an attempt to create such a format:

GenBank                           gi|gi-number|gb|accession|locus
EMBL Data Library                 gi|gi-number|emb|accession|locus
DDBJ, DNA Database of Japan       gi|gi-number|dbj|accession|locus
NBRF PIR                          pir||entry
Protein Research Foundation       prf||name
SWISS-PROT                        sp|accession|name
Brookhaven Protein Data Bank (1)  pdb|entry|chain
Brookhaven Protein Data Bank (2)  entry:chain|PDBID|CHAIN|SEQUENCE
Patents                           pat|country|number 
GenInfo Backbone Id               bbs|number 
General database identifier       gnl|database|identifier
NCBI Reference Sequence           ref|accession|locus
Local Sequence identifier         lcl|identifier

Sequence representation

After the header line and comments
each line of a sequence should have fewer than 80 characters
Sequences may be protein sequences or nucleic acid sequences
can contain gaps or alignment characters
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:
- lower-case letters are accepted and are mapped into upper-case
- a single hyphen or dash can be used to represent a gap character
- in amino acid sequences: U and * are acceptable letters
Numerical digits are not allowed but are used in some databases to indicate the position in the sequence

The nucleic acid codes supported are:

Nucleic Acid Code	Meaning
A	Adenosine
C	Cytidine
G	Guanine
T	Thymidine
U	Uracil
R	G A (puRine)
Y	T C (pYrimidine)
K	G T (Ketone)
M	A C (aMino group)
S	G C (Strong interaction)
W	A T (Weak interaction)
B	G T C (not A) (B comes after A)
D	G A T (not C) (D comes after C)
H	A C T (not G) (H comes after G)
V	G C A (not T, not U) (V comes after U)
N	A G C T (aNy)
X	masked
-	gap of indeterminate length

The amino acid codes supported are:

Amino Acid Code	Meaning
A	Alanine
B	Aspartic acid or Asparagine
C	Cysteine
D	Aspartic acid
E	Glutamic acid
F	Phenylalanine
G	Glycine
H	Histidine
I	Isoleucine
K	Lysine
L	Leucine
M	Methionine
N	Asparagine
P	Proline
Q	Glutamine
R	Arginine
S	Serine
T	Threonine
U	Selenocysteine
V	Valine
W	Tryptophan
Y	Tyrosine
Z	Glutamic acid or Glutamine
X	any
*	translation stop
-	gap of indeterminate length

Masterarbeit, Heidi Lischer

Table of Contents