MEGA

MEGA
documentation

Version 5 (Aril 24, 2011)
MEGA is an integrated tool for conducting automatic and manual sequence alignment, inferring phylogenetic trees, mining web-based databases, estimating rates of molecular evolution, and testing evolutionary hypotheses.

Program information

Windows XP, Vista, 7 (with at least 64 MB of RAM, 20 MB of available hard disk space)
MEGA also can be run on other operating systems for which Windows emulators are available:
- Macintosh: Windows using VirtualPC
- Sun Workstation: SoftWindows95
- Linux: Windows using VMWare

Data type handled

DNA
RNA
nucleotide
distance
(protein sequences)

Input Files

ASCII-text files
extension: *.MEG
Importing Data from Other Formats:
- CLUSTAL
- NEXUS
- PHYLIP (Interleaved/Noninterleaved)
- GCG
- FASTA
- PIR
- NBRF
- MSF
- IG
- Internet (NCBI) XML format

Common Features

first line: must contain the keyword #MEGA
second line: data file may contain a succinct description of the data (called Title). The Title statement is written according to a set of rules:
- always begins with !Title and ends with a semicolon
- not occupy more than one line of text
- must not contain a semicolon inside the statement
- example:
```
#mega
!Title This is an example title;
```
third line: Description statement: more descriptive multi-line account of the data.
- always begins with !Description and ends with a semicolon
- may occupy multiple lines
- must not contain a semicolon inside the statement
- example:
```
#mega
!Title This is an example title;
!Description This is detailed information the data file;
```
Format statement: which includes information on the type of data present in the file and some of its attributes.
- written after the Title or the Description statement
- contains one or more command statements
- A command statement contains a command and a valid setting keyword (command=keyword format)

Comments:
- anywhere in the data file
- can span multiple lines
- enclosed in square brackets ([and])
- can be nested
keywords:
- written in any combination of lower- and upper-case letters
Taxa Names:
- ‘#’ Sign: Every Iabel must be written on a new line, and a '#' sign must precede the label
- no restrictions on the length of the Iabels
- not required to be unique (although identical labels may result in ambiguities and should be avoided)
- must start with alphanumeric characters (0-9, a-z, and A-Z) or a special character: -, + or .
- After the first character, taxa labels may contain the following additional special characters:_, *, :, ( ), |, \, /
- For multiple word labels, an underscore can be used to represent a blank space

Sequence Input Data

must consist of two or more sequences of equal length
sequences must be aligned
written in IUPAC single-letter codes
Sequences can be written in any combination of upper- and lower-case letters
spaces and tabs are ignored
generally used special symbols : period (.) → identical sites, dash (-) → alignment gaps, question mark (?) → missing data

Keywords for Format Statement:

Command	Setting	Remark	Example
DataType	DNA, RNA, nucleotide, protein		DataType=DNA
NSeqs	integer	Number of sequences	NSeqs=85
NTaxa	integer	Synonymous with NSeqs	NTaxa=85
NSites	integer	Number of nucleotides	Nsites=4592
Property	Exon, Intron, Coding, Noncoding, and End	Specifies whether a domain is protein coding. Exon and Coding are synonymous, as are Intron and Noncoding. End specifies that the domain with the given name ends at this point	Property=cyt_b
Indel	single character	dash (-) to identify insertion/deletions	Indel = -
Identical	single character	use period (.) to show identity with the first sequence	Identical = .
MatchChar	single character	Synonymous with the identical keyword	MatchChar = .
Missing	single character	use question mark (?) to indicate missing data	Missing = ?
CodeTable	A name	This instruction gives the name of the code table for the protein coding domains of the data	CodeTable = Standard

Defining Genes and Domains:
- attributes of different sites (and groups of sites, termed domains) are specified within the data “on the spot” rather than in an attributes block before or after the actual data.

Command	Setting	Remark	Example
Domain	A name	defines a domain with the given name	Domain=first_exon
Gene	A name	defines a gene with the given name	Gene=cytb
Property	Exon, Intron, Coding, Noncoding, and End	specifies the protein-coding attribute for a domain	Property=cytb
CodonStart	A number	specifies the site where the next 1st-codon position will be found in a protein-coding domain	CodonStart=2

Defining Groups:
- assign different taxa to groups in a sequence as well as to distance data files.
- the name of the group is written in a set of curly brackets ({}) following the taxa name. The group name can be attached to the taxa name using an underscore or just can be appended.
- there should be no spaces between the taxa name and group name

Labelling Individual Sites:
- The individual sites in nucleotide or amino acid data can be labeled to construct non-contiguous sets of sites.
- Each site can be associated with only one label
- A label can be a letter or a number.

example

!Gene=FirstGene Domain=Exon1 Property=Coding;
#Human_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCTCAAT
#Mouse_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCCCAAT
#Chicken_{Aves} ATGGTTTCTAGTCAGCTCACCATGATAGGTCTCAAT

!Gene=SecondGene Domain=Intron Property=Noncoding;
#Human ATTCCCAGGGAATTCCCGGGGGGTTTAAGGCCCCTTTAAAGAAAGAT
#Mouse GTAGCGCGCGTCGTCAGAGCTCCCAAGGGTAGCAGTCACAGAAAGAT
#Chicken GTAAAAAAAAAAGTCAGAGCTCCCCCCAATATATATCACAGAAAGAT

!Gene=ThirdGene Domain=Exon2 Property=Coding;
#Human ATCTGCTCTCGAGTACTGATACAAATGACTTCTGCGTACAACTGA
#Mouse ATCTGATCTCGTGTGCTGGTACGAATGATTTCTGCGTTCAACTGA
#Chicken ATCTGCTCTCGAGTACTGCTACCAATGACTTCTGCGTACAACTGA
!Label +++__-+++-a-+++-L-+++-k-+++123+++-_-+++---+++;

Distance Input Data

in the lower-left or in the upper-right triangular matrix
After writing the #mega,!Title,!Description, and !Format commands (some of which are optional), you then need to write all the taxa names (see below)
Taxa names are followed by the distance matrix

Keywords for Format Statement:

Command	Setting	Remark	Example
DataType	Distance	Specifies that the distance data is in the file	DataType=distance
NSeqs	integer	Number of sequences	NSeqs=85
NTaxa	integer	Same as NSeqs	NTaxa=85
DataFormat	Lowerleft, upperright	Specifies whether the data is in lower left triangular matrix or the upper right triangular matrix	DataFormat=lowerleft

Defining Groups:
- see above

example

#mega
!Title: Concatenated Files;
!Format DataType=Distance DataFormat=LowerLeft NTaxa=6;

#Rodent
#Primate
#Lagomorpha
#Artiodactyla
#Carnivora
#Perissodactyla
      
0.514       
0.535 0.436       
0.530 0.388 0.418       
0.521 0.353 0.417 0.345       
0.500 0.331 0.402 0.327 0.349

How to cite

Citation for MEGA 5:

Tamura K, Peterson D, Peterson N, Stecher G, Nei M, and Kumar S (2011) MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Molecular Biology and Evolution doi: 10.1093/molbev/msr121.

Citation for MEGA 4:

Tamura K, Dudley J, Nei M & Kumar S (2007) MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular Biology and Evolution 24: 1596-1599.

Masterarbeit, Heidi Lischer

Table of Contents

MEGA

Program information

Data type handled

Input Files

Common Features

Sequence Input Data

example

Distance Input Data

example

How to cite