User Tools

Site Tools


fasta

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
fasta [2007/12/17 14:47] heidifasta [2008/07/22 13:31] (current) – external edit 127.0.0.1
Line 1: Line 1:
 ====== FASTA ====== ====== FASTA ======
 +**[[http://en.wikipedia.org/wiki/FASTA_format|wikipedia: FASTA format]]**\\
 +[[http://www.ncbi.nlm.nih.gov/blast/fasta.shtml|NCBI's FASTA format description]]\\
 +
 +\\
 FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
 +
  
 ===== format information ===== ===== format information =====
   * text based   * text based
 +  * no standard file extension for a text file containing FASTA formatted sequences. FASTA format files often have file extensions like .fa, .mpfa, .fna, .fsa, .fas or .fasta
  
 ===== Data type handled ===== ===== Data type handled =====
   * nucleic acid sequences   * nucleic acid sequences
   * peptide sequences   * peptide sequences
 +
 +
 +
 +
 +
 +
 +
  
 ===== file format ===== ===== file format =====
-===== How to cite =====+  * begins with a single-line description, followed by lines of sequence data 
 +  * It is recommended that all lines of text be shorter than 80 characters 
 +  * The sequence ends if another line starting with a ">" appears (this indicates the start of another sequence 
 +**simple examples:** 
 +<code> 
 +>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] 
 +LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV 
 +EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG 
 +LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL 
 +GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX 
 +IENY 
 +</code> 
 + 
 +example of a multiple sequence FASTA file: 
 +<code> 
 +>SEQUENCE_1 
 +MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG 
 +LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK 
 +IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL 
 +MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL 
 +>SEQUENCE_2 
 +SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI 
 +ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH 
 +</code> 
 + 
 +=== Header line ==
 +  * begins with ">" 
 +  * word following is the identifier and/or name of the sequence (optional) 
 +  * rest of the line is the description (optional) 
 +  * no space between the ">" and the first letter of the identifier 
 +  * header line may contain more than one header separated by a ^A (Control-A) character 
 + 
 +\\ 
 +  * Sequence identifiers: 
 +    * Many different sequence databases use standardized headers, which helps when automatically extracting information from the header 
 +    * NCBI defined a standard for the unique identifier  
 +    * they do not give a definitive description of the FASTA defline format, an attempt to create such a format:  
 + 
 +| GenBank                          | ''gi|gi-number|gb|accession|locus''
 +| EMBL Data Library                | ''gi|gi-number|emb|accession|locus''
 +| DDBJ, DNA Database of Japan      | ''gi|gi-number|dbj|accession|locus''
 +| NBRF PIR                         | ''pir||entry''
 +| Protein Research Foundation      | ''prf||name''
 +| SWISS-PROT                       | ''sp|accession|name''
 +| Brookhaven Protein Data Bank (1) | ''pdb|entry|chain''
 +| Brookhaven Protein Data Bank (2) | ''entry:chain|PDBID|CHAIN|SEQUENCE''
 +| Patents                          | ''pat|country|number''
 +| GenInfo Backbone Id              | ''bbs|number''
 +| General database identifier      | ''gnl|database|identifier''
 +| NCBI Reference Sequence          | ''ref|accession|locus''
 +| Local Sequence identifier        | ''lcl|identifier''
 +//Anm: Die gi-Nummer ist eine Abfolge von Ziffern, die einen Datenbankeintrag des NCBI markiert.// 
 + 
 +\\ 
 +=== Sequence representation ==
 +  * After the header line and comments 
 +  * each line of a sequence should have fewer than 80 characters 
 +  * Sequences may be protein sequences or nucleic acid sequences 
 +  * can contain gaps or alignment characters  
 +  * Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:  
 +    * lower-case letters are accepted and are mapped into upper-case 
 +    * a single hyphen or dash can be used to represent a gap character 
 +    * in amino acid sequences: U and * are acceptable letters 
 +  * Numerical digits are not allowed but are used in some databases to indicate the position in the sequence 
 + 
 +\\ 
 +The nucleic acid codes supported are: 
 + 
 +^ Nucleic Acid Code ^  Meaning  ^   
 +|                 | Adenosine | 
 +|                 | Cytidine |  
 +|                 | Guanine |  
 +|                 | Thymidine |  
 +|                 | Uracil | 
 +|                 | G A (puRine) |  
 +|                 | T C (pYrimidine) |  
 +|                 | G T (Ketone) |  
 +|                 | A C (aMino group) | 
 +|                 | G C (Strong interaction) | 
 +|                 | A T (Weak interaction) |  
 +|                 | G T C (not A) (B comes after A) |  
 +|                 | G A T (not C) (D comes after C) | 
 +|                 | A C T (not G) (H comes after G) | 
 +|                 | G C A (not T, not U) (V comes after U) | 
 +|                 | A G C T (aNy) |  
 +|                 | masked | 
 +|                 | gap of indeterminate length | 
 + 
 +\\ 
 +The amino acid codes supported are: 
 + 
 +^ Amino Acid Code ^ Meaning ^    
 +|               | Alanine | 
 +|               | Aspartic acid or Asparagine | 
 +|               | Cysteine | 
 +|               | Aspartic acid | 
 +|               | Glutamic acid | 
 +|               | Phenylalanine | 
 +|               | Glycine | 
 +|               | Histidine | 
 +|               | Isoleucine |  
 +|               | Lysine | 
 +|               | Leucine | 
 +|               | Methionine | 
 +|               | Asparagine | 
 +|               | Proline |  
 +|               | Glutamine | 
 +|               | Arginine | 
 +|               | Serine | 
 +|               | Threonine | 
 +|               | Selenocysteine | 
 +|               | Valine | 
 +|               | Tryptophan |  
 +|               | Tyrosine | 
 +|               | Glutamic acid or Glutamine | 
 +|               | any |  
 +|               | translation stop |  
 +|               | gap of indeterminate length|  
 + 
 + 
 + 
 + 
 + 
 + 
 +===== converter ===== 
 +[[http://iubio.bio.indiana.edu/soft/molbio/readseq/|Readseq]] for converting sequence formats to FASTA \\ 
 +[[http://www.bugaco.com/bioinf/|Nexus to Fasta converter]]\\ 
 +[[http://gp2fasta.ovh.org/|GenBank to Fasta conventer]]
fasta.1197899270.txt.gz · Last modified: 2008/07/22 13:30 (external edit)