====== FASTA ====== **[[http://en.wikipedia.org/wiki/FASTA_format|wikipedia: FASTA format]]**\\ [[http://www.ncbi.nlm.nih.gov/blast/fasta.shtml|NCBI's FASTA format description]]\\ \\ FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. ===== format information ===== * text based * no standard file extension for a text file containing FASTA formatted sequences. FASTA format files often have file extensions like .fa, .mpfa, .fna, .fsa, .fas or .fasta ===== Data type handled ===== * nucleic acid sequences * peptide sequences ===== file format ===== * begins with a single-line description, followed by lines of sequence data * It is recommended that all lines of text be shorter than 80 characters * The sequence ends if another line starting with a ">" appears (this indicates the start of another sequence **simple examples:** >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY example of a multiple sequence FASTA file: >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH === Header line === * begins with ">" * word following is the identifier and/or name of the sequence (optional) * rest of the line is the description (optional) * no space between the ">" and the first letter of the identifier * header line may contain more than one header separated by a ^A (Control-A) character \\ * Sequence identifiers: * Many different sequence databases use standardized headers, which helps when automatically extracting information from the header * NCBI defined a standard for the unique identifier * they do not give a definitive description of the FASTA defline format, an attempt to create such a format: | GenBank | ''gi|gi-number|gb|accession|locus'' | | EMBL Data Library | ''gi|gi-number|emb|accession|locus'' | | DDBJ, DNA Database of Japan | ''gi|gi-number|dbj|accession|locus'' | | NBRF PIR | ''pir||entry'' | | Protein Research Foundation | ''prf||name'' | | SWISS-PROT | ''sp|accession|name'' | | Brookhaven Protein Data Bank (1) | ''pdb|entry|chain'' | | Brookhaven Protein Data Bank (2) | ''entry:chain|PDBID|CHAIN|SEQUENCE'' | | Patents | ''pat|country|number'' | | GenInfo Backbone Id | ''bbs|number'' | | General database identifier | ''gnl|database|identifier'' | | NCBI Reference Sequence | ''ref|accession|locus'' | | Local Sequence identifier | ''lcl|identifier'' | //Anm: Die gi-Nummer ist eine Abfolge von Ziffern, die einen Datenbankeintrag des NCBI markiert.// \\ === Sequence representation === * After the header line and comments * each line of a sequence should have fewer than 80 characters * Sequences may be protein sequences or nucleic acid sequences * can contain gaps or alignment characters * Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: * lower-case letters are accepted and are mapped into upper-case * a single hyphen or dash can be used to represent a gap character * in amino acid sequences: U and * are acceptable letters * Numerical digits are not allowed but are used in some databases to indicate the position in the sequence \\ The nucleic acid codes supported are: ^ Nucleic Acid Code ^ Meaning ^ | A | Adenosine | | C | Cytidine | | G | Guanine | | T | Thymidine | | U | Uracil | | R | G A (puRine) | | Y | T C (pYrimidine) | | K | G T (Ketone) | | M | A C (aMino group) | | S | G C (Strong interaction) | | W | A T (Weak interaction) | | B | G T C (not A) (B comes after A) | | D | G A T (not C) (D comes after C) | | H | A C T (not G) (H comes after G) | | V | G C A (not T, not U) (V comes after U) | | N | A G C T (aNy) | | X | masked | | - | gap of indeterminate length | \\ The amino acid codes supported are: ^ Amino Acid Code ^ Meaning ^ | A | Alanine | | B | Aspartic acid or Asparagine | | C | Cysteine | | D | Aspartic acid | | E | Glutamic acid | | F | Phenylalanine | | G | Glycine | | H | Histidine | | I | Isoleucine | | K | Lysine | | L | Leucine | | M | Methionine | | N | Asparagine | | P | Proline | | Q | Glutamine | | R | Arginine | | S | Serine | | T | Threonine | | U | Selenocysteine | | V | Valine | | W | Tryptophan | | Y | Tyrosine | | Z | Glutamic acid or Glutamine | | X | any | | * | translation stop | | - | gap of indeterminate length| ===== converter ===== [[http://iubio.bio.indiana.edu/soft/molbio/readseq/|Readseq]] for converting sequence formats to FASTA \\ [[http://www.bugaco.com/bioinf/|Nexus to Fasta converter]]\\ [[http://gp2fasta.ovh.org/|GenBank to Fasta conventer]]