STRUCTURE

STRUCTURE
documentation

Version 2.3.3 (January 2010)
The program structure implements a model-based clustering method for inferring population structure using genotype data consisting of unlinked markers. It includes inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed.

Program information

written in C with Java front end
UNIX
Windows
Mac OS X

Data type handled

SNP (numeric)
Microsatellites
RFLP
AFLP
dipoid/haploid

Input Files

The entire data set is arranged as a matrix in a single file, in which the data for individuals are in rows, and the loci are in columns. For a diploid organism, data for each individual can be stored either as 2 consecutive rows, where each locus is in one column, or in one row, where each locus is in two consecutive columns.

rows:

Marker Names (Optional; string): The first row can contain a list of identifiers for each of the markers in the data set. This row contains L strings of integers or characters, where L is the number of loci.
Recessive Alleles (Data with dominant markers only; integer): Data sets of SNPs or microsatellites would generally not include this line. However if the option RECESSIVEALLELES is set to 1, then the program requires this row to indicate which allele (if any) is recessive at each marker.
Inter-Marker Distances (Optional; real numbers): the next row is a set of inter-marker distances, for use with linked loci (contains L real numbers). These should be genetic distances (e.g., centiMorgans), or some proxy for this based, for example, on physical distances. The markers must be in map order within linkage groups. When consecutive markers are from different linkage groups (e.g., different chromosomes), this should be indicated by the value -1. The first marker is also assigned the value -1. All other distances are non-negative.
Phase Information (Optional; diploid data only; real number in the range [0,1]): This is for use with the linkage model only. A single row of L probabilities that appears after the genotype data for each individual. There are two alternative representations for the phase information:
1. the two rows of data for an individual are assumed to correspond to the paternal and maternal contributions, The phase line indicates the probability that the ordering is correct at the current marker (set MARKOVPHASE=0) respectively.
2. the phase line indicates the probability that the phase of one allele relative to the previous allele is correct (set MARKOVPHASE=1)

The first entry should be filled in with 0.5 to fill out the line to L entries. For example the following data input would represent the information from an male with 5 unphased autosomal microsatellite loci followed by three X chromosome loci, using the maternal/paternal phase model (the 0.5 indicates that the autosomal loci are unphased, and the 1.0s indicate that the X chromosome loci are have been maternally inherited with probability 1.0, and hence are phased.:

102 156 165 101 143 105 104 101
100 148 163 101 143  -9  -9  -9
0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0

Individual/Genotype data (Required): Data for each sampled individual are arranged into

one or more rows.

Individual/genotype data:

Each row of individual data contains the following elements. These form columns in the data file:

Label (Optional; string): A string of integers or characters used to designate each individual in the sample.
PopData (Optional; integer): An integer designating a user-defined population from which the individual was obtained
PopFlag (Optional; 0 or 1): A Boolean flag which indicates whether to use (1, don't use = 0) the PopData when using learning samples
Phenotype (Optional; integer): An integer designating the value of a phenotype of interest, for each individual.
Extra Columns (Optional; string): It may be convenient for the user to include additional data in the input file which are ignored by the program. These go here, and may be strings of integers or characters.
Genotype Data (Required; integer): Each allele at a given locus should be coded by a unique integer (eg microsatellite repeat score).

Missing genotype data:

Missing data should be indicated by a number that doesn't occur elsewhere in the data (often -9 by convention). The missing-data value is set along with the other parameters describing the characteristics of the data set.

example:

example for genotype data:

            loc_a  loc_b  loc_c  loc_d  loc_e
George   1   -9     145     66     0     92
George   1   -9     -9      64     0     94
Paula    1   106    142     68     1     92
Paula    1   106    148     64     0     94
Matthew  2   110    145     -9     0     92
Matthew  2   110    148     66     1     -9
Bob      2   108    142     64     1     94
Bob      2   -9     142     -9     0     94
Anja     1   112    142     -9     1     -9
Anja     1   114    142     66     1     94
Peter    1   -9     145     66     0     -9
Peter    1   110    145     -9     1     -9
Carsten  2   108    145     62     0     -9
Carsten  2   110    145     64     1     92

How to cite

The basic algorithm :

Pritchard, J. K., Stephens, M., and Donnelly, P. (2000a). Inference of population structure using multilocus genotype data. Genetics, 155:945-959.

Extensions to the method:

Falush, D., Stephens, M., and Pritchard, J. K. (2003a). Inference of population structure: Extensions to linked loci and correlated allele frequencies. Genetics, 164:1567-1587.
Falush, D., Stephens, M., and Pritchard, J. K. (2007). Inference of population structure using multilocus genotype data: dominant markers and null alleles. Molecular Ecology Notes.
Hubisz M. J., et al. (2009). Inferring weak population structure with the assistance of sample group information. Molecular Ecology Resources, 9:1322-32.

Masterarbeit, Heidi Lischer

Table of Contents