Differences

This shows you the differences between two versions of the page.

--- fasta [2007/12/17 15:35] – heidi
+++ fasta [2008/07/22 13:31] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
 ====== FASTA ======
+**[[http://en.wikipedia.org/wiki/FASTA_format|wikipedia: FASTA format]]**\\
+[[http://www.ncbi.nlm.nih.gov/blast/fasta.shtml|NCBI's FASTA format description]]\\
+\\
 FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
 ===== format information =====
   * text based
+  * no standard file extension for a text file containing FASTA formatted sequences. FASTA format files often have file extensions like .fa, .mpfa, .fna, .fsa, .fas or .fasta
 ===== Data type handled =====
   * nucleic acid sequences
   * peptide sequences
 ===== file format =====
   * begins with a single-line description, followed by lines of sequence data
-  * description line:
-    * ">" symbol in the first column
-    * word following is the identifier of the sequence (optional)
-    * rest of the line is the description (optional)
-    * no space between the ">" and the first letter of the identifier
   * It is recommended that all lines of text be shorter than 80 characters
   * The sequence ends if another line starting with a ">" appears (this indicates the start of another sequence
-**simple example:**
+**simple examples:**
 <code>
 >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
@@ Line 27: / Line 34: @@
 GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
 IENY
+</code>
+example of a multiple sequence FASTA file:
+<code>
+>SEQUENCE_1
+MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
+LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
+IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
+MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
+>SEQUENCE_2
+SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
+ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
 </code>
 === Header line ===
   * begins with ">"
-  * gives a name and/or a unique identifier for the sequence
+  * word following is the identifier and/or name of the sequence (optional)
-  * and other informations
+  * rest of the line is the description (optional)
-  * Many different sequence databases use standardized headers, which helps when automatically extracting information from the header.
+  * no space between the ">" and the first letter of the identifier
-  * header line may contain more than one header separated by a ^A (Control-A) character.
+  * header line may contain more than one header separated by a ^A (Control-A) character
+\\
+  * Sequence identifiers:
+    * Many different sequence databases use standardized headers, which helps when automatically extracting information from the header
+    * NCBI defined a standard for the unique identifier
+    * they do not give a definitive description of the FASTA defline format, an attempt to create such a format:
+| GenBank                          | ''gi|gi-number|gb|accession|locus'' |
+| EMBL Data Library                | ''gi|gi-number|emb|accession|locus'' |
+| DDBJ, DNA Database of Japan      | ''gi|gi-number|dbj|accession|locus'' |
+| NBRF PIR                         | ''pir||entry'' |
+| Protein Research Foundation      | ''prf||name'' |
+| SWISS-PROT                       | ''sp|accession|name'' |
+| Brookhaven Protein Data Bank (1) | ''pdb|entry|chain'' |
+| Brookhaven Protein Data Bank (2) | ''entry:chain|PDBID|CHAIN|SEQUENCE'' |
+| Patents                          | ''pat|country|number'' |
+| GenInfo Backbone Id              | ''bbs|number'' |
+| General database identifier      | ''gnl|database|identifier'' |
+| NCBI Reference Sequence          | ''ref|accession|locus'' |
+| Local Sequence identifier        | ''lcl|identifier'' |
+//Anm: Die gi-Nummer ist eine Abfolge von Ziffern, die einen Datenbankeintrag des NCBI markiert.//
+\\
 === Sequence representation ===
   * After the header line and comments
@@ Line 70: / Line 111: @@
 |       -           | gap of indeterminate length |
+\\
 The amino acid codes supported are:
@@ Line 102: / Line 144: @@
-===== How to cite =====
+===== converter =====
+[[http://iubio.bio.indiana.edu/soft/molbio/readseq/|Readseq]] for converting sequence formats to FASTA \\
+[[http://www.bugaco.com/bioinf/|Nexus to Fasta converter]]\\
+[[http://gp2fasta.ovh.org/|GenBank to Fasta conventer]]