User Tools

Site Tools


fasta

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
fasta [2007/12/14 15:22] – created heidifasta [2008/07/22 13:31] (current) – external edit 127.0.0.1
Line 1: Line 1:
 ====== FASTA ====== ====== FASTA ======
-===== Program information =====+**[[http://en.wikipedia.org/wiki/FASTA_format|wikipedia: FASTA format]]**\\ 
 +[[http://www.ncbi.nlm.nih.gov/blast/fasta.shtml|NCBI's FASTA format description]]\\ 
 + 
 +\\ 
 +FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. 
 + 
 + 
 +===== format information ===== 
 +  * text based 
 +  * no standard file extension for a text file containing FASTA formatted sequences. FASTA format files often have file extensions like .fa, .mpfa, .fna, .fsa, .fas or .fasta 
 ===== Data type handled ===== ===== Data type handled =====
-===== Input Files ===== +  * nucleic acid sequences 
-===== How to cite =====+  * peptide sequences 
 + 
 + 
 + 
 + 
 + 
 + 
 + 
 + 
 +===== file format ===== 
 +  * begins with a single-line description, followed by lines of sequence data 
 +  * It is recommended that all lines of text be shorter than 80 characters 
 +  * The sequence ends if another line starting with a ">" appears (this indicates the start of another sequence 
 +**simple examples:** 
 +<code> 
 +>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] 
 +LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV 
 +EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG 
 +LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL 
 +GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX 
 +IENY 
 +</code> 
 + 
 +example of a multiple sequence FASTA file: 
 +<code> 
 +>SEQUENCE_1 
 +MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG 
 +LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK 
 +IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL 
 +MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL 
 +>SEQUENCE_2 
 +SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI 
 +ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH 
 +</code> 
 + 
 +=== Header line ==
 +  * begins with ">" 
 +  * word following is the identifier and/or name of the sequence (optional) 
 +  * rest of the line is the description (optional) 
 +  * no space between the ">" and the first letter of the identifier 
 +  * header line may contain more than one header separated by a ^A (Control-A) character 
 + 
 +\\ 
 +  * Sequence identifiers: 
 +    * Many different sequence databases use standardized headers, which helps when automatically extracting information from the header 
 +    * NCBI defined a standard for the unique identifier  
 +    * they do not give a definitive description of the FASTA defline format, an attempt to create such a format:  
 + 
 +| GenBank                          | ''gi|gi-number|gb|accession|locus''
 +| EMBL Data Library                | ''gi|gi-number|emb|accession|locus''
 +| DDBJ, DNA Database of Japan      | ''gi|gi-number|dbj|accession|locus''
 +| NBRF PIR                         | ''pir||entry''
 +| Protein Research Foundation      | ''prf||name''
 +| SWISS-PROT                       | ''sp|accession|name''
 +| Brookhaven Protein Data Bank (1) | ''pdb|entry|chain''
 +| Brookhaven Protein Data Bank (2) | ''entry:chain|PDBID|CHAIN|SEQUENCE''
 +| Patents                          | ''pat|country|number''
 +| GenInfo Backbone Id              | ''bbs|number''
 +| General database identifier      | ''gnl|database|identifier''
 +| NCBI Reference Sequence          | ''ref|accession|locus''
 +| Local Sequence identifier        | ''lcl|identifier''
 +//Anm: Die gi-Nummer ist eine Abfolge von Ziffern, die einen Datenbankeintrag des NCBI markiert.// 
 + 
 +\\ 
 +=== Sequence representation ==
 +  * After the header line and comments 
 +  * each line of a sequence should have fewer than 80 characters 
 +  * Sequences may be protein sequences or nucleic acid sequences 
 +  * can contain gaps or alignment characters  
 +  * Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions:  
 +    * lower-case letters are accepted and are mapped into upper-case 
 +    * a single hyphen or dash can be used to represent a gap character 
 +    * in amino acid sequences: U and * are acceptable letters 
 +  * Numerical digits are not allowed but are used in some databases to indicate the position in the sequence 
 + 
 +\\ 
 +The nucleic acid codes supported are: 
 + 
 +^ Nucleic Acid Code ^  Meaning  ^   
 +|                 | Adenosine | 
 +|                 | Cytidine |  
 +|                 | Guanine |  
 +|                 | Thymidine |  
 +|                 | Uracil | 
 +|                 | G A (puRine) |  
 +|                 | T C (pYrimidine) |  
 +|                 | G T (Ketone) |  
 +|                 | A C (aMino group) | 
 +|                 | G C (Strong interaction) | 
 +|                 | A T (Weak interaction) |  
 +|                 | G T C (not A) (B comes after A) |  
 +|                 | G A T (not C) (D comes after C) | 
 +|                 | A C T (not G) (H comes after G) | 
 +|                 | G C A (not T, not U) (V comes after U) | 
 +|                 | A G C T (aNy) |  
 +|                 | masked | 
 +|                 | gap of indeterminate length | 
 + 
 +\\ 
 +The amino acid codes supported are: 
 + 
 +^ Amino Acid Code ^ Meaning ^    
 +|               | Alanine | 
 +|               | Aspartic acid or Asparagine | 
 +|               | Cysteine | 
 +|               | Aspartic acid | 
 +|               | Glutamic acid | 
 +|               | Phenylalanine | 
 +|               | Glycine | 
 +|               | Histidine | 
 +|               | Isoleucine |  
 +|               | Lysine | 
 +|               | Leucine | 
 +|               | Methionine | 
 +|               | Asparagine | 
 +|               | Proline |  
 +|               | Glutamine | 
 +|               | Arginine | 
 +|               | Serine | 
 +|               | Threonine | 
 +|               | Selenocysteine | 
 +|               | Valine | 
 +|               | Tryptophan |  
 +|               | Tyrosine | 
 +|               | Glutamic acid or Glutamine | 
 +|               | any |  
 +|               | translation stop |  
 +|               | gap of indeterminate length|  
 + 
 + 
 + 
 + 
 + 
 + 
 +===== converter ===== 
 +[[http://iubio.bio.indiana.edu/soft/molbio/readseq/|Readseq]] for converting sequence formats to FASTA \\ 
 +[[http://www.bugaco.com/bioinf/|Nexus to Fasta converter]]\\ 
 +[[http://gp2fasta.ovh.org/|GenBank to Fasta conventer]]
fasta.1197642149.txt.gz · Last modified: 2008/07/22 13:30 (external edit)