This is an old revision of the document!

STRUCTURE

Version 2.2 (April 3, 2007)
The program structure implements a model-based clustering method for inferring population structure using genotype data consisting of unlinked markers. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed.

Program information

written in C with Java front end
UNIX
Windows
Mac OS X

Data type handled

SNP
Microsatellites
RFLP
AFLP
dipoid/haploid

Input Files

The entire data set is arranged as a matrix in a single file, in which the data for individuals are in rows, and the loci are in columns. For a diploid organism, data for each individual can be stored either as 2 consecutive rows, where each locus is in one column, or in one row, where each locus is in two consecutive columns.

rows:

Marker Names (Optional; string): The first row can contain a list of identifiers for each of the markers in the data set. This row contains L strings of integers or characters, where L is the number of loci.
Recessive Alleles (Data with dominant markers only; integer): Data sets of SNPs or microsatellites would generally not include this line. However if the option RECESSIVEALLELES is set to 1, then the program requires this row to indicate which allele (if any) is recessive at each marker.
Inter-Marker Distances (Optional; real numbers): the next row is a set of inter-marker distances, for use with linked loci (contains L real numbers). These should be genetic distances (e.g., centiMorgans), or some proxy for this based, for example, on physical distances. The markers must be in map order within linkage groups. When consecutive markers are from different linkage groups (e.g., different chromosomes), this should be indicated by the value -1. The first marker is also assigned the value -1. All other distances are non-negative.
Phase Information (Optional; diploid data only; real number in the range [0,1]). This is for use with the linkage model only. A single row of L probabilities that appears after the genotype data for each individual. There are two alternative representations for the phase information:

example:

example for the genotype data:

            loc_a  loc_b  loc_c  loc_d  loc_e
George   1   -9     145     66     0     92
George   1   -9     -9      64     0     94
Paula    1   106    142     68     1     92
Paula    1   106    148     64     0     94
Matthew  2   110    145     -9     0     92
Matthew  2   110    148     66     1     -9
Bob      2   108    142     64     1     94
Bob      2   -9     142     -9     0     94
Anja     1   112    142     -9     1     -9
Anja     1   114    142     66     1     94
Peter    1   -9     145     66     0     -9
Peter    1   110    145     -9     1     -9
Carsten  2   108    145     62     0     -9
Carsten  2   110    145     64     1     92

How to cite

The basic algorithm :
Pritchard, J. K., Stephens, M., and Donnelly, P. (2000a). Inference of population structure using multilocus genotype data. Genetics, 155:945{959.

Extensions to the method:
Falush, D., Stephens, M., and Pritchard, J. K. (2003a). Inference of population structure: Exten- sions to linked loci and correlated allele frequencies. Genetics, 164:1567{1587.
and
Falush, D., Stephens, M., and Pritchard, J. K. (2007). Inference of population structure using multilocus genotype data: dominant markers and null alleles. Molecular Ecology Notes.

Masterarbeit, Heidi Lischer

Table of Contents