User Tools

Site Tools


fasta

FASTA

wikipedia: FASTA format
NCBI's FASTA format description


FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.

format information

  • text based
  • no standard file extension for a text file containing FASTA formatted sequences. FASTA format files often have file extensions like .fa, .mpfa, .fna, .fsa, .fas or .fasta

Data type handled

  • nucleic acid sequences
  • peptide sequences

file format

  • begins with a single-line description, followed by lines of sequence data
  • It is recommended that all lines of text be shorter than 80 characters
  • The sequence ends if another line starting with a “>” appears (this indicates the start of another sequence

simple examples:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

example of a multiple sequence FASTA file:

>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Header line

  • begins with “>”
  • word following is the identifier and/or name of the sequence (optional)
  • rest of the line is the description (optional)
  • no space between the “>” and the first letter of the identifier
  • header line may contain more than one header separated by a ^A (Control-A) character


  • Sequence identifiers:
    • Many different sequence databases use standardized headers, which helps when automatically extracting information from the header
    • NCBI defined a standard for the unique identifier
    • they do not give a definitive description of the FASTA defline format, an attempt to create such a format:
GenBank gi|gi-number|gb|accession|locus
EMBL Data Library gi|gi-number|emb|accession|locus
DDBJ, DNA Database of Japan gi|gi-number|dbj|accession|locus
NBRF PIR pir||entry
Protein Research Foundation prf||name
SWISS-PROT sp|accession|name
Brookhaven Protein Data Bank (1) pdb|entry|chain
Brookhaven Protein Data Bank (2) entry:chain|PDBID|CHAIN|SEQUENCE
Patents pat|country|number
GenInfo Backbone Id bbs|number
General database identifier gnl|database|identifier
NCBI Reference Sequence ref|accession|locus
Local Sequence identifier lcl|identifier

Anm: Die gi-Nummer ist eine Abfolge von Ziffern, die einen Datenbankeintrag des NCBI markiert.


Sequence representation

  • After the header line and comments
  • each line of a sequence should have fewer than 80 characters
  • Sequences may be protein sequences or nucleic acid sequences
  • can contain gaps or alignment characters
  • Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:
    • lower-case letters are accepted and are mapped into upper-case
    • a single hyphen or dash can be used to represent a gap character
    • in amino acid sequences: U and * are acceptable letters
  • Numerical digits are not allowed but are used in some databases to indicate the position in the sequence


The nucleic acid codes supported are:

Nucleic Acid Code Meaning
A Adenosine
C Cytidine
G Guanine
T Thymidine
U Uracil
R G A (puRine)
Y T C (pYrimidine)
K G T (Ketone)
M A C (aMino group)
S G C (Strong interaction)
W A T (Weak interaction)
B G T C (not A) (B comes after A)
D G A T (not C) (D comes after C)
H A C T (not G) (H comes after G)
V G C A (not T, not U) (V comes after U)
N A G C T (aNy)
X masked
- gap of indeterminate length


The amino acid codes supported are:

Amino Acid Code Meaning
A Alanine
B Aspartic acid or Asparagine
C Cysteine
D Aspartic acid
E Glutamic acid
F Phenylalanine
G Glycine
H Histidine
I Isoleucine
K Lysine
L Leucine
M Methionine
N Asparagine
P Proline
Q Glutamine
R Arginine
S Serine
T Threonine
U Selenocysteine
V Valine
W Tryptophan
Y Tyrosine
Z Glutamic acid or Glutamine
X any
* translation stop
- gap of indeterminate length

converter

Readseq for converting sequence formats to FASTA
Nexus to Fasta converter
GenBank to Fasta conventer

fasta.txt · Last modified: 2008/07/22 13:31 by 127.0.0.1