User Tools

Site Tools


fasta

This is an old revision of the document!


FASTA

FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.

format information

  • text based

Data type handled

  • nucleic acid sequences
  • peptide sequences

file format

  • begins with a single-line description, followed by lines of sequence data
  • description line:
    • “>” symbol in the first column
    • word following is the identifier of the sequence (optional)
    • rest of the line is the description (optional)
    • no space between the “>” and the first letter of the identifier
  • It is recommended that all lines of text be shorter than 80 characters
  • The sequence ends if another line starting with a “>” appears (this indicates the start of another sequence

simple example:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

Header line

  • begins with “>”
  • gives a name and/or a unique identifier for the sequence
  • and other informations
  • Many different sequence databases use standardized headers, which helps when automatically extracting information from the header.
  • header line may contain more than one header separated by a ^A (Control-A) character.

Sequence representation

  • After the header line and comments
  • each line of a sequence should have fewer than 80 characters
  • Sequences may be protein sequences or nucleic acid sequences
  • can contain gaps or alignment characters
  • Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:
    • lower-case letters are accepted and are mapped into upper-case
    • a single hyphen or dash can be used to represent a gap character
    • in amino acid sequences: U and * are acceptable letters
  • Numerical digits are not allowed but are used in some databases to indicate the position in the sequence


The nucleic acid codes supported are:

Nucleic Acid Code Meaning
A Adenosine
C Cytidine
G Guanine
T Thymidine
U Uracil
R G A (puRine)
Y T C (pYrimidine)
K G T (Ketone)
M A C (aMino group)
S G C (Strong interaction)
W A T (Weak interaction)
B G T C (not A) (B comes after A)
D G A T (not C) (D comes after C)
H A C T (not G) (H comes after G)
V G C A (not T, not U) (V comes after U)
N A G C T (aNy)
X masked
- gap of indeterminate length

The amino acid codes supported are:

Amino Acid Code Meaning
A Alanine
B Aspartic acid or Asparagine
C Cysteine
D Aspartic acid
E Glutamic acid
F Phenylalanine
G Glycine
H Histidine
I Isoleucine
K Lysine
L Leucine
M Methionine
N Asparagine
P Proline
Q Glutamine
R Arginine
S Serine
T Threonine
U Selenocysteine
V Valine
W Tryptophan
Y Tyrosine
Z Glutamic acid or Glutamine
X any
* translation stop
- gap of indeterminate length

How to cite

fasta.1197902103.txt.gz · Last modified: 2008/07/22 13:30 (external edit)