====== MEGA ====== {{mega_logo.jpg}} \\ **[[http://www.megasoftware.net/|MEGA]]**\\ [[http://www.megasoftware.net/mega4.pdf|documentation]] \\ Version 5 (Aril 24, 2011)\\ MEGA is an integrated tool for conducting automatic and manual sequence alignment, inferring phylogenetic trees, mining web-based databases, estimating rates of molecular evolution, and testing evolutionary hypotheses. ===== Program information ===== * Windows XP, Vista, 7 (with at least 64 MB of RAM, 20 MB of available hard disk space) * MEGA also can be run on other operating systems for which Windows emulators are available: * Macintosh: Windows using VirtualPC * Sun Workstation: SoftWindows95 * Linux: Windows using VMWare ===== Data type handled ===== * DNA * RNA * nucleotide * distance * (protein sequences) ===== Input Files ===== * ASCII-text files * extension: *.MEG * Importing Data from Other Formats: * CLUSTAL * [[NEXUS]] * [[PHYLIP]] (Interleaved/Noninterleaved) * GCG * [[FASTA]] * PIR * NBRF * MSF * IG * Internet (NCBI) XML format ==== Common Features ==== * first line: must contain the keyword #MEGA * second line: data file may contain a succinct description of the data (called Title). The **Title statement** is written according to a set of rules: * always begins with ''!Title'' and ends with a semicolon * not occupy more than one line of text * must not contain a semicolon inside the statement * example: #mega !Title This is an example title; * third line: **Description statement**: more descriptive multi-line account of the data. * always begins with ''!Description'' and ends with a semicolon * may occupy multiple lines * must not contain a semicolon inside the statement * example: #mega !Title This is an example title; !Description This is detailed information the data file; * **Format statement**: which includes information on the type of data present in the file and some of its attributes. * written after the Title or the Description statement * contains one or more command statements * A command statement contains a command and a valid setting keyword (''command=keyword format'') \\ * Comments: * anywhere in the data file * can span multiple lines * enclosed in square brackets ([and]) * can be nested * keywords: * written in any combination of lower- and upper-case letters * Taxa Names: * ‘#’ Sign: Every Iabel must be written on a new line, and a '#' sign must precede the label * no restrictions on the length of the Iabels * not required to be unique (although identical labels may result in ambiguities and should be avoided) * must start with alphanumeric characters (0-9, a-z, and A-Z) or a special character: ''-, + or .'' * After the first character, taxa labels may contain the following additional special characters:''_, *, :, ( ), |, \, /'' * For multiple word labels, an underscore can be used to represent a blank space \\ ==== Sequence Input Data ==== * must consist of two or more sequences of equal length * sequences must be aligned * written in IUPAC single-letter codes * Sequences can be written in any combination of upper- and lower-case letters * spaces and tabs are ignored * generally used special symbols : period (.) -> identical sites, dash (-) -> alignment gaps, question mark (?) -> missing data \\ * **Keywords for Format Statement:** ^ Command ^ Setting ^ Remark ^ Example ^ | DataType | DNA, RNA, nucleotide, protein | | DataType=DNA | | NSeqs | integer | Number of sequences | NSeqs=85 | | NTaxa | integer | Synonymous with NSeqs | NTaxa=85 | | NSites | integer | Number of nucleotides |Nsites=4592 | | Property | Exon, Intron, Coding, Noncoding, and End | Specifies whether a domain is protein coding. Exon and Coding are synonymous, as are Intron and Noncoding. End specifies that the domain with the given name ends at this point | Property=cyt_b | | Indel | single character | dash (-) to identify insertion/deletions | Indel = - | | Identical | single character | use period (.) to show identity with the first sequence | Identical = . | | MatchChar | single character | Synonymous with the identical keyword | MatchChar = . | | Missing | single character | use question mark (?) to indicate missing data | Missing = ? | | CodeTable | A name | This instruction gives the name of the code table for the protein coding domains of the data | CodeTable = Standard | * **Defining Genes and Domains:** * attributes of different sites (and groups of sites, termed domains) are specified within the data "on the spot" rather than in an attributes block before or after the actual data. ^ Command ^ Setting ^ Remark ^ Example ^ | Domain | A name | defines a domain with the given name | Domain=first_exon | | Gene | A name | defines a gene with the given name | Gene=cytb | | Property | Exon, Intron, Coding, Noncoding, and End | specifies the protein-coding attribute for a domain | Property=cytb | | CodonStart | A number | specifies the site where the next 1st-codon position will be found in a protein-coding domain | CodonStart=2 | * **Defining Groups:** * assign different taxa to groups in a sequence as well as to distance data files. * the name of the group is written in a set of curly brackets ({}) following the taxa name. The group name can be attached to the taxa name using an underscore or just can be appended. * there should be no spaces between the taxa name and group name * **Labelling Individual Sites:** * The individual sites in nucleotide or amino acid data can be labeled to construct non-contiguous sets of sites. * Each site can be associated with only one label * A label can be a letter or a number. === example === !Gene=FirstGene Domain=Exon1 Property=Coding; #Human_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCTCAAT #Mouse_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCCCAAT #Chicken_{Aves} ATGGTTTCTAGTCAGCTCACCATGATAGGTCTCAAT !Gene=SecondGene Domain=Intron Property=Noncoding; #Human ATTCCCAGGGAATTCCCGGGGGGTTTAAGGCCCCTTTAAAGAAAGAT #Mouse GTAGCGCGCGTCGTCAGAGCTCCCAAGGGTAGCAGTCACAGAAAGAT #Chicken GTAAAAAAAAAAGTCAGAGCTCCCCCCAATATATATCACAGAAAGAT !Gene=ThirdGene Domain=Exon2 Property=Coding; #Human ATCTGCTCTCGAGTACTGATACAAATGACTTCTGCGTACAACTGA #Mouse ATCTGATCTCGTGTGCTGGTACGAATGATTTCTGCGTTCAACTGA #Chicken ATCTGCTCTCGAGTACTGCTACCAATGACTTCTGCGTACAACTGA !Label +++__-+++-a-+++-L-+++-k-+++123+++-_-+++---+++; ==== Distance Input Data ==== * in the lower-left or in the upper-right triangular matrix * After writing the #mega,!Title,!Description, and !Format commands (some of which are optional), you then need to write all the taxa names (see below) * Taxa names are followed by the distance matrix \\ * **Keywords for Format Statement:** ^ Command ^ Setting ^ Remark ^ Example ^ | DataType | Distance | Specifies that the distance data is in the file | DataType=distance | | NSeqs | integer | Number of sequences | NSeqs=85 | | NTaxa | integer | Same as NSeqs | NTaxa=85 | | DataFormat | Lowerleft, upperright | Specifies whether the data is in lower left triangular matrix or the upper right triangular matrix | DataFormat=lowerleft | * **Defining Groups:** * see above === example === #mega !Title: Concatenated Files; !Format DataType=Distance DataFormat=LowerLeft NTaxa=6; #Rodent #Primate #Lagomorpha #Artiodactyla #Carnivora #Perissodactyla 0.514 0.535 0.436 0.530 0.388 0.418 0.521 0.353 0.417 0.345 0.500 0.331 0.402 0.327 0.349 ===== How to cite ===== Citation for MEGA 5: * Tamura K, Peterson D, Peterson N, Stecher G, Nei M, and Kumar S (2011) MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Molecular Biology and Evolution doi: 10.1093/molbev/msr121. \\ Citation for MEGA 4: * Tamura K, Dudley J, Nei M & Kumar S (2007) MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular Biology and Evolution 24: 1596-1599.