mega
Table of Contents
MEGA
Version 5 (Aril 24, 2011)
MEGA is an integrated tool for conducting automatic and manual sequence alignment, inferring phylogenetic trees, mining web-based databases, estimating rates of molecular evolution, and testing evolutionary hypotheses.
Program information
- Windows XP, Vista, 7 (with at least 64 MB of RAM, 20 MB of available hard disk space)
- MEGA also can be run on other operating systems for which Windows emulators are available:
- Macintosh: Windows using VirtualPC
- Sun Workstation: SoftWindows95
- Linux: Windows using VMWare
Data type handled
- DNA
- RNA
- nucleotide
- distance
- (protein sequences)
Input Files
- ASCII-text files
- extension: *.MEG
- Importing Data from Other Formats:
- CLUSTAL
- PHYLIP (Interleaved/Noninterleaved)
- GCG
- PIR
- NBRF
- MSF
- IG
- Internet (NCBI) XML format
Common Features
- first line: must contain the keyword #MEGA
- second line: data file may contain a succinct description of the data (called Title). The Title statement is written according to a set of rules:
- always begins with
!Titleand ends with a semicolon - not occupy more than one line of text
- must not contain a semicolon inside the statement
- example:
#mega !Title This is an example title;
- third line: Description statement: more descriptive multi-line account of the data.
- always begins with
!Descriptionand ends with a semicolon - may occupy multiple lines
- must not contain a semicolon inside the statement
- example:
#mega !Title This is an example title; !Description This is detailed information the data file;
- Format statement: which includes information on the type of data present in the file and some of its attributes.
- written after the Title or the Description statement
- contains one or more command statements
- A command statement contains a command and a valid setting keyword (
command=keyword format)
- Comments:
- anywhere in the data file
- can span multiple lines
- enclosed in square brackets ([and])
- can be nested
- keywords:
- written in any combination of lower- and upper-case letters
- Taxa Names:
- ‘#’ Sign: Every Iabel must be written on a new line, and a '#' sign must precede the label
- no restrictions on the length of the Iabels
- not required to be unique (although identical labels may result in ambiguities and should be avoided)
- must start with alphanumeric characters (0-9, a-z, and A-Z) or a special character:
-, + or . - After the first character, taxa labels may contain the following additional special characters:
_, *, :, ( ), |, \, / - For multiple word labels, an underscore can be used to represent a blank space
Sequence Input Data
- must consist of two or more sequences of equal length
- sequences must be aligned
- written in IUPAC single-letter codes
- Sequences can be written in any combination of upper- and lower-case letters
- spaces and tabs are ignored
- generally used special symbols : period (.) → identical sites, dash (-) → alignment gaps, question mark (?) → missing data
- Keywords for Format Statement:
| Command | Setting | Remark | Example |
|---|---|---|---|
| DataType | DNA, RNA, nucleotide, protein | DataType=DNA | |
| NSeqs | integer | Number of sequences | NSeqs=85 |
| NTaxa | integer | Synonymous with NSeqs | NTaxa=85 |
| NSites | integer | Number of nucleotides | Nsites=4592 |
| Property | Exon, Intron, Coding, Noncoding, and End | Specifies whether a domain is protein coding. Exon and Coding are synonymous, as are Intron and Noncoding. End specifies that the domain with the given name ends at this point | Property=cyt_b |
| Indel | single character | dash (-) to identify insertion/deletions | Indel = - |
| Identical | single character | use period (.) to show identity with the first sequence | Identical = . |
| MatchChar | single character | Synonymous with the identical keyword | MatchChar = . |
| Missing | single character | use question mark (?) to indicate missing data | Missing = ? |
| CodeTable | A name | This instruction gives the name of the code table for the protein coding domains of the data | CodeTable = Standard |
- Defining Genes and Domains:
- attributes of different sites (and groups of sites, termed domains) are specified within the data “on the spot” rather than in an attributes block before or after the actual data.
| Command | Setting | Remark | Example |
|---|---|---|---|
| Domain | A name | defines a domain with the given name | Domain=first_exon |
| Gene | A name | defines a gene with the given name | Gene=cytb |
| Property | Exon, Intron, Coding, Noncoding, and End | specifies the protein-coding attribute for a domain | Property=cytb |
| CodonStart | A number | specifies the site where the next 1st-codon position will be found in a protein-coding domain | CodonStart=2 |
- Defining Groups:
- assign different taxa to groups in a sequence as well as to distance data files.
- the name of the group is written in a set of curly brackets ({}) following the taxa name. The group name can be attached to the taxa name using an underscore or just can be appended.
- there should be no spaces between the taxa name and group name
- Labelling Individual Sites:
- The individual sites in nucleotide or amino acid data can be labeled to construct non-contiguous sets of sites.
- Each site can be associated with only one label
- A label can be a letter or a number.
example
!Gene=FirstGene Domain=Exon1 Property=Coding;
#Human_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCTCAAT
#Mouse_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCCCAAT
#Chicken_{Aves} ATGGTTTCTAGTCAGCTCACCATGATAGGTCTCAAT
!Gene=SecondGene Domain=Intron Property=Noncoding;
#Human ATTCCCAGGGAATTCCCGGGGGGTTTAAGGCCCCTTTAAAGAAAGAT
#Mouse GTAGCGCGCGTCGTCAGAGCTCCCAAGGGTAGCAGTCACAGAAAGAT
#Chicken GTAAAAAAAAAAGTCAGAGCTCCCCCCAATATATATCACAGAAAGAT
!Gene=ThirdGene Domain=Exon2 Property=Coding;
#Human ATCTGCTCTCGAGTACTGATACAAATGACTTCTGCGTACAACTGA
#Mouse ATCTGATCTCGTGTGCTGGTACGAATGATTTCTGCGTTCAACTGA
#Chicken ATCTGCTCTCGAGTACTGCTACCAATGACTTCTGCGTACAACTGA
!Label +++__-+++-a-+++-L-+++-k-+++123+++-_-+++---+++;
Distance Input Data
- in the lower-left or in the upper-right triangular matrix
- After writing the #mega,!Title,!Description, and !Format commands (some of which are optional), you then need to write all the taxa names (see below)
- Taxa names are followed by the distance matrix
- Keywords for Format Statement:
| Command | Setting | Remark | Example |
|---|---|---|---|
| DataType | Distance | Specifies that the distance data is in the file | DataType=distance |
| NSeqs | integer | Number of sequences | NSeqs=85 |
| NTaxa | integer | Same as NSeqs | NTaxa=85 |
| DataFormat | Lowerleft, upperright | Specifies whether the data is in lower left triangular matrix or the upper right triangular matrix | DataFormat=lowerleft |
- Defining Groups:
- see above
example
#mega
!Title: Concatenated Files;
!Format DataType=Distance DataFormat=LowerLeft NTaxa=6;
#Rodent
#Primate
#Lagomorpha
#Artiodactyla
#Carnivora
#Perissodactyla
0.514
0.535 0.436
0.530 0.388 0.418
0.521 0.353 0.417 0.345
0.500 0.331 0.402 0.327 0.349
How to cite
Citation for MEGA 5:
- Tamura K, Peterson D, Peterson N, Stecher G, Nei M, and Kumar S (2011) MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Molecular Biology and Evolution doi: 10.1093/molbev/msr121.
Citation for MEGA 4:
- Tamura K, Dudley J, Nei M & Kumar S (2007) MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular Biology and Evolution 24: 1596-1599.
mega.txt · Last modified: 2011/07/07 11:50 by heidi
