PHYLIP

PHYLIP

Version 3.69 (September 2009)
PHYLIP, the Phylogeny Inference Package, is a package of programs for inferring phylogenies (evolutionary trees). It can infer phylogenies by parsimony, compatibility, distance matrix methods, and likelihood. It can also compute consensus trees, compute distances between trees, draw trees, resample data sets by bootstrapping or jackknifing, edit trees, and compute distance matrices.

Program information

written in C
Windows
Mac OS X
Mac OS 9
UNIX
Linux

Data type handled

nucleotide sequences
protein sequences
gene frequencies
restriction sites
restriction fragments
distances
discrete characters
continuous characters

Input Files

nucleotide sequences

For most of the PHYLIP programs, information comes from a series of input files, and ends up in a series of output files:

                   -------------------
                  |                   |
infile ---------> |                   |
                  |                   |
intree ---------> |                   | -----------> outfile
                  |                   |
weights --------> |      program      | -----------> outtree
                  |                   |
categories -----> |                   | -----------> plotfile
                  |                   |
fontfile -------> |                   |
                  |                   |
                   -------------------

Input data such as DNA sequences comes from a file whose default name is infile. If the user supplies a tree, this is in a file whose default name is intree. Values of weights for the characters are in weights, and the tree plotting program need some digitized fonts which are supplied in fontfile (all these are default names).

first line: the number of species and the number of characters. These are in free format, separated by blanks
next lines: information for each species, starting with a ten-character species name (which can include blanks and some punctuation marks. The name should be ten characters in length, filled out to the full ten characters by blanks if shorter), and continuing with the characters for that species. The name should be on the same line as the first character of the data for that species.
In the discrete-character programs, DNA sequence programs and protein sequence programs the characters are each a single letter or digit, sometimes separated by blanks. In the continuous-characters programs they are real numbers with decimal points, separated by blanks:
```
Latimeria 2.03 3.457 100.2 0.0 -3.7
```
The conventions about continuing the data beyond one line per species are different between the molecular sequence programs and the others. The molecular sequence programs can take the data in “aligned” or “interleaved” format, in which we first have some lines giving the first part of each of the sequences, then some lines giving the next part of each, and so on. Thus the sequences might look like this:
```
    6   39
Archaeopt CGATGCTTAC CGCCGATGCT
HesperorniCGTTACTCGT TGTCGTTACT
BaluchitheTAATGTTAAT TGTTAATGTT
B. virginiTAATGTTCGT TGTTAATGTT
BrontosaurCAAAACCCAT CATCAAAACC
B.subtilisGGCAGCCAAT CACGGCAGCC

TACCGCCGAT GCTTACCGC
CGTTGTCGTT ACTCGTTGT
AATTGTTAAT GTTAATTGT
CGTTGTTAAT GTTCGTTGT
CATCATCAAA ACCCATCAT
AATCACGGCA GCCAATCAC
```
extension: .txt
blanks within sequences are allowed to make them easier to read, also digits are ignored in the sequence
It is important that the number of sites in each group be the same for all species
In the sequential format, the character data can run on to a new line at any time. Thus it is legal to have:
```
Archaeopt 001100 
1101 
```
or even:
```
Archaeopt 
0011001101 
```

example:

the simplest version of the input data file looks something like this:

   6   13
Archaeopt CGATGCTTAC CGC
HesperorniCGTTACTCGT TGT
BaluchitheTAATGTTAAT TGT
B. virginiTAATGTTCGT TGT
BrontosaurCAAAACCCAT CAT
B.subtilisGGCAGCCAAT CAC

example of interleaved format:

  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT
H. SapiensACCGGTTGGC CGTTCAGGGT
Chimp     AAACCCTTGC CGTTACGCTT
Gorilla   AAACCCTTGC CGGTACGCTT

GAGCCCGGGC AATACAGGGT AT
GAGCCGTGGC CGGGCACGGT AT
ACAGGTTGGC CGTTCAGGGT AA
AAACCGAGGC CGGGACACTC AT
AAACCATTGC CGGTACGCTT AA

sequential format the same sequences would be:

  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
GAGCCCGGGC AATACAGGGT AT
Salmo gairAAGCCTTGGC AGTGCAGGGT
GAGCCGTGGC CGGGCACGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT
ACAGGTTGGC CGTTCAGGGT AA
Chimp     AAACCCTTGC CGTTACGCTT
AAACCGAGGC CGGGACACTC AT
Gorilla   AAACCCTTGC CGGTACGCTT
AAACCATTGC CGGTACGCTT AA

Distance Matrix

first line of the input file contains the number of species
There follows species data, starting with a species name.
- species name is ten characters long, and must be padded out with blanks if shorter
- For each species there then follows a set of distances to all the other species (allow the distance matrix to be upper or lower triangular or square). The distances can continue to a new line after any of them. If the matrix is lower-triangular, the diagonal entries (the distances from a species to itself) will not be read by the programs. If they are included anyway, they will be ignored by the programs, except for the case where one of them starts a new line, in which case the program will mistake it for a species name and get very confused.

examples:

sample input matrix, with a square matrix:

     5
Alpha      0.000 1.000 2.000 3.000 3.000
Beta       1.000 0.000 2.000 3.000 3.000
Gamma      2.000 2.000 0.000 3.000 3.000
Delta      3.000 3.000 0.000 0.000 1.000
Epsilon    3.000 3.000 3.000 1.000 0.000

sample lower-triangular input matrix with distances continuing to new lines as needed:

   14
Mouse     
Bovine      1.7043
Lemur       2.0235  1.1901
Tarsier     2.1378  1.3287  1.2905
Squir Monk  1.5232  1.2423  1.3199  1.7878
Jpn Macaq   1.8261  1.2508  1.3887  1.3137  1.0642
Rhesus Mac  1.9182  1.2536  1.4658  1.3788  1.1124  0.1022
Crab-E.Mac  2.0039  1.3066  1.4826  1.3826  0.9832  0.2061  0.2681
BarbMacaq   1.9431  1.2827  1.4502  1.4543  1.0629  0.3895  0.3930  0.3665
Gibbon      1.9663  1.3296  1.8708  1.6683  0.9228  0.8035  0.7109  0.8132
  0.7858
Orang       2.0593  1.2005  1.5356  1.6606  1.0681  0.7239  0.7290  0.7894
  0.7140  0.7095
Gorilla     1.6664  1.3460  1.4577  1.5935  0.9127  0.7278  0.7412  0.8763
  0.7966  0.5959  0.4604
Chimp       1.7320  1.3757  1.7803  1.7119  1.0635  0.7899  0.8742  0.8868
  0.8288  0.6213  0.5065  0.3502
Human       1.7101  1.3956  1.6661  1.7599  1.0557  0.6933  0.7118  0.7589
  0.8542  0.5612  0.4700  0.3097  0.2712

How to cite

Felsenstein, J. 2004. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle.

Or if the editor for whom you are writing insists that the citation must be to a printed publication, you could cite a notice for version 3.2 published in Cladistics:
Felsenstein, J. 1989. PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5: 164-166.

Masterarbeit, Heidi Lischer

Table of Contents