Differences

This shows you the differences between two versions of the page.

--- phylip [2008/04/23 09:11] – heidi
+++ phylip [2011/07/07 12:49] (current) – heidi
@@ Line 6: / Line 6: @@
 \\
-Version 3.67 (July, 2007)\\
+Version 3.69 (September 2009)\\
 PHYLIP, the Phylogeny Inference Package, is a package of programs for inferring phylogenies (evolutionary trees). It can infer phylogenies by parsimony, compatibility, distance matrix methods, and likelihood. It can also compute consensus trees, compute distances between trees, draw trees, resample data sets by bootstrapping or jackknifing, edit trees, and compute distance matrices.
@@ Line 33: / Line 33: @@
 ===== Input Files =====
 ==== nucleotide sequences ====
@@ Line 58: / Line 61: @@
   * next lines: information for each species, starting with a ten-character species name (which can include blanks and some punctuation marks. The name should be ten characters in length, filled out to the full ten characters by blanks if shorter), and continuing with the characters for that species. The name should be on the same line as the first character of the data for that species.
   * In the discrete-character programs, DNA sequence programs and protein sequence programs the characters are each a single letter or digit, sometimes separated by blanks. In the continuous-characters programs they are real numbers with decimal points, separated by blanks: <code>Latimeria 2.03 3.457 100.2 0.0 -3.7</code>
-  * The conventions about continuing the data beyond one line per species are different between the molecular sequence programs and the others. The molecular sequence programs can take the data in "aligned" or "interleaved" format, in which we first have some lines giving the first part of each of the sequences, then some lines giving the next part of each, and so on. Thus the sequences might look like this: <code>
+  * The conventions about continuing the data beyond one line per species are different between the molecular sequence programs and the others. The molecular sequence programs can take the data in "aligned" or **"interleaved" format**, in which we first have some lines giving the first part of each of the sequences, then some lines giving the next part of each, and so on. Thus the sequences might look like this: <code>
    39
 Archaeopt CGATGCTTAC CGCCGATGCT
@@ Line 74: / Line 77: @@
 AATCACGGCA GCCAATCAC
 </code>
-  * blanks within sequences are allowed to make them easier to read
+  * extension: .txt
+  * blanks within sequences are allowed to make them easier to read, also digits are ignored in the sequence
   * It is important that the number of sites in each group be the same for all species
-  * In the sequential format, the character data can run on to a new line at any time. Thus it is legal to have: <code>Archaeopt 001100
+  * In the **sequential format**, the character data can run on to a new line at any time. Thus it is legal to have: <code>Archaeopt 001100
 </code> or even: <code>
@@ Line 88: / Line 92: @@
 === example: ===
-For the parsimony, compatibility and maximum likelihood programs, excluding the distance matrix methods, the simplest version of the input data file looks something like this:
+  * the simplest version of the input data file looks something like this: <code>
-<code>
    13
 Archaeopt CGATGCTTAC CGC
@@ Line 99: / Line 102: @@
 </code>
-==== gene frequencies ====
+  * example of interleaved format: <code>
-  * the first line contains the number of species (or populations) and the number of loci and the options information
+    42
-  * follows a line which gives the numbers of alleles at each locus, in order. This must be the full number of alleles, not the number of alleles which will be input: i. e. for a two-allele locus the number should be 2, not 1
+Turkey    AAGCTNGGGC ATTTCAGGGT
-  * There then follow the species (population) data, each species beginning on a new line:
+Salmo gairAAGCCTTGGC AGTGCAGGGT
-    * first 10 characters are taken as the name
+H. SapiensACCGGTTGGC CGTTCAGGGT
-    * thereafter the values of the individual characters are read free-format, preceded and separated by blanks
+Chimp     AAACCCTTGC CGTTACGCTT
-  * Missing data is not allowed!
+Gorilla   AAACCCTTGC CGGTACGCTT
+GAGCCCGGGC AATACAGGGT AT
+GAGCCGTGGC CGGGCACGGT AT
+ACAGGTTGGC CGTTCAGGGT AA
+AAACCGAGGC CGGGACACTC AT
+AAACCATTGC CGGTACGCTT AA
+</code>
+  * sequential format the same sequences would be: <code>
+    42
+Turkey    AAGCTNGGGC ATTTCAGGGT
+GAGCCCGGGC AATACAGGGT AT
+Salmo gairAAGCCTTGGC AGTGCAGGGT
+GAGCCGTGGC CGGGCACGGT AT
+H. SapiensACCGGTTGGC CGTTCAGGGT
+ACAGGTTGGC CGTTCAGGGT AA
+Chimp     AAACCCTTGC CGTTACGCTT
+AAACCGAGGC CGGGACACTC AT
+Gorilla   AAACCCTTGC CGGTACGCTT
+AAACCATTGC CGGTACGCTT AA
+</code>
+==== Distance Matrix ====
+  * first line of the input file contains the number of species
+  * There follows species data, starting with a species name.
+    * species name is ten characters long, and must be padded out with blanks if shorter
+    * For each species there then follows a set of distances to all the other species (allow the distance matrix to be upper or lower triangular or square). The distances can continue to a new line after any of them. If the matrix is lower-triangular, the diagonal entries (the distances from a species to itself) will not be read by the programs. If they are included anyway, they will be ignored by the programs, except for the case where one of them starts a new line, in which case the program will mistake it for a species name and get very confused.
+=== examples: ===
+  * sample input matrix, with a square matrix:<code>
+
+Alpha      0.000 1.000 2.000 3.000 3.000
+Beta       1.000 0.000 2.000 3.000 3.000
+Gamma      2.000 2.000 0.000 3.000 3.000
+Delta      3.000 3.000 0.000 0.000 1.000
+Epsilon    3.000 3.000 3.000 1.000 0.000
+</code>
+  * sample lower-triangular input matrix with distances continuing to new lines as needed: <code>
+
+Mouse
+Bovine      1.7043
+Lemur       2.0235  1.1901
+Tarsier     2.1378  1.3287  1.2905
+Squir Monk  1.5232  1.2423  1.3199  1.7878
+Jpn Macaq   1.8261  1.2508  1.3887  1.3137  1.0642
+Rhesus Mac  1.9182  1.2536  1.4658  1.3788  1.1124  0.1022
+Crab-E.Mac  2.0039  1.3066  1.4826  1.3826  0.9832  0.2061  0.2681
+BarbMacaq   1.9431  1.2827  1.4502  1.4543  1.0629  0.3895  0.3930  0.3665
+Gibbon      1.9663  1.3296  1.8708  1.6683  0.9228  0.8035  0.7109  0.8132
+.7858
+Orang       2.0593  1.2005  1.5356  1.6606  1.0681  0.7239  0.7290  0.7894
+.7140  0.7095
+Gorilla     1.6664  1.3460  1.4577  1.5935  0.9127  0.7278  0.7412  0.8763
+.7966  0.5959  0.4604
+Chimp       1.7320  1.3757  1.7803  1.7119  1.0635  0.7899  0.8742  0.8868
+.8288  0.6213  0.5065  0.3502
+Human       1.7101  1.3956  1.6661  1.7599  1.0557  0.6933  0.7118  0.7589
+.8542  0.5612  0.4700  0.3097  0.2712
+</code>
 ===== How to cite =====