User Tools

Site Tools


fasta

This is an old revision of the document!


FASTA

FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.

format information

  • text based

Data type handled

  • nucleic acid sequences
  • peptide sequences

file format

  • begins with a single-line description, followed by lines of sequence data
  • It is recommended that all lines of text be shorter than 80 characters
  • The sequence ends if another line starting with a “>” appears (this indicates the start of another sequence

simple examples:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

example of a multiple sequence FASTA file:

>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Header line

  • begins with “>”
  • word following is the identifier and/or name of the sequence (optional)
  • rest of the line is the description (optional)
  • no space between the “>” and the first letter of the identifier
  • header line may contain more than one header separated by a ^A (Control-A) character


  • Sequence identifiers:
    • Many different sequence databases use standardized headers, which helps when automatically extracting information from the header
    • NCBI defined a standard for the unique identifier
    • they do not give a definitive description of the FASTA defline format, an attempt to create such a format:
      GenBank                           gi|gi-number|gb|accession|locus
      EMBL Data Library                 gi|gi-number|emb|accession|locus
      DDBJ, DNA Database of Japan       gi|gi-number|dbj|accession|locus
      NBRF PIR                          pir||entry
      Protein Research Foundation       prf||name
      SWISS-PROT                        sp|accession|name
      Brookhaven Protein Data Bank (1)  pdb|entry|chain
      Brookhaven Protein Data Bank (2)  entry:chain|PDBID|CHAIN|SEQUENCE
      Patents                           pat|country|number 
      GenInfo Backbone Id               bbs|number 
      General database identifier       gnl|database|identifier
      NCBI Reference Sequence           ref|accession|locus
      Local Sequence identifier         lcl|identifier

Sequence representation

  • After the header line and comments
  • each line of a sequence should have fewer than 80 characters
  • Sequences may be protein sequences or nucleic acid sequences
  • can contain gaps or alignment characters
  • Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:
    • lower-case letters are accepted and are mapped into upper-case
    • a single hyphen or dash can be used to represent a gap character
    • in amino acid sequences: U and * are acceptable letters
  • Numerical digits are not allowed but are used in some databases to indicate the position in the sequence


The nucleic acid codes supported are:

Nucleic Acid Code Meaning
A Adenosine
C Cytidine
G Guanine
T Thymidine
U Uracil
R G A (puRine)
Y T C (pYrimidine)
K G T (Ketone)
M A C (aMino group)
S G C (Strong interaction)
W A T (Weak interaction)
B G T C (not A) (B comes after A)
D G A T (not C) (D comes after C)
H A C T (not G) (H comes after G)
V G C A (not T, not U) (V comes after U)
N A G C T (aNy)
X masked
- gap of indeterminate length


The amino acid codes supported are:

Amino Acid Code Meaning
A Alanine
B Aspartic acid or Asparagine
C Cysteine
D Aspartic acid
E Glutamic acid
F Phenylalanine
G Glycine
H Histidine
I Isoleucine
K Lysine
L Leucine
M Methionine
N Asparagine
P Proline
Q Glutamine
R Arginine
S Serine
T Threonine
U Selenocysteine
V Valine
W Tryptophan
Y Tyrosine
Z Glutamic acid or Glutamine
X any
* translation stop
- gap of indeterminate length

How to cite

fasta.1197902975.txt.gz · Last modified: 2008/07/22 13:30 (external edit)