BAPS

BAPS
manual

Version 5.4 (29.04.2010)
A program for Bayesian inference of the genetic structure in a population. Assigns individuals to genetic clusters by either considering them as immigrants (mixture analysis) or ad descendants from immigrants (admixture analysis).

Program information

Windows XP/Vista/7 (32-bit, 64-bit)
Mac Snow leopard OS X (64-bit)
Linux (32-bit)

Data type handled

haploid/diploid/(tetraploid)
DNA
SNP (sequence/numeric)
AFLP
Microsatellite
Standard (multi-allelic markers)

Input Files

Clustering of individuals:

BAPS format:

data matrix:
- columns: loci at which the individuals were observed
- rows: individuals
- additional column in the right end of the matrix: contains on each row the index of the individual whose alleles are presented on the row. There can be more than one row per individual (e.g. in diploids)
alleles: indexed with any non-negative integer
individuals: indices start with 1 for the firs individual and end with the value that corresponds to the total number of individuals
missing allele: negative integer (e.g. -999 or -9)
population of the individuals are known: two additional file: one containing the names of the populations, the other containing the indices of the first individuals of each sampling populations

example (cluster 5 diploid individuals. The first individual has alleles 5 and 7 at the first locus and so on. Individuals 1, 2 and 3 were sampled in America and individuals 4 and 5 in Europe):

data file:

5         2         1
7         2         1
5         8         2
3         9         2
2         5         3
-999      5         3
5         -999      4
2         3         4
3         8         5
2         5         5

name file:
```
American
European
```

Index file:
```
1
4
```

GENEPOP format:

GENEPOP
BAPS uses the labels of the first individuals of the populations as names for the populations

Clustering of groups of individuals:

BAPS format:

Like above
Last column contains the index of the group that is the origin of the alleles on the particular row (instead of specifying the individual)
the names of can be given in a seperate file

example (data from four distinct groups)

data file:

5     2     1
7     2     1
5     8     1
3     9     2
2     5     2
-999  5     3
5     -999  4
2     3     4
3     8     4
2     5     4

name file:
```
American
European
African
Asian
```

GENEPOP format:

like above

Trained clustering:

Must provide two data files:

reference individuals whose origins are known
sampling units (individuals or groups of individuals) that you wish to cluster

both in GENEPOP format. Individuals in one population (separated by “pop”) in reference data file correspond to individuals from a single origin.
in both data files all individuals should be given names. These names will be needed by the program when the output is written

example (reference data from two populations (s, r). We wish to cluster three sampling units (1unit: ind1,…). If there is no relevant information for such pre-grouping of the data to be clustered, then every individual should be one sampling unit in the input data set):

reference data file:

--individuals with known origins--
loc1, loc2
pop
s1, 0307 0202
s2, 0303 0201
pop
r1, 0502 0401
r2, 0200 0404

file specifying the sampling units:

--sampling units--
loc1, loc2
pop 
ind1, 0404 0304
pop
ind2, 0307 0202
ind3, 0303 0102
pop
ind4, 0505 0404

Spatial clustering:

Same as the first two above, except for the coordinate values that need to be given in a separate file:

as many rows as there are individuals (spatial clustering of individuals → sampling coordinates of each individual) or groups (spatial clustering of groups → sampling coordinates of each group) in the molecular data set.
missing coordinate: two consecutive zeros

example:

Data file: see first example
Coordinate file:
```
172  88
155  96
180  78
0    0
-18  81
```

Clustering of linked molecular data (sequence data):

MLST data format:

for prokaryotik organism
first column: identifier where the numbering should go linearly from 1 to number of isolates (unique for each)
second column: unique ID label for each isolate (for printing results). The header could either be “Isolate” or “Strain”
third column (optional): provides a species or similar group name for the isolates
remaining columns: genes for which there are aligned sequences available
if header is given: columns can be in different order

example:

ST   Isolate     Species            Adk      GyrB     Hsp60    Mdh      Pgi      RecA
1    1A1         My.Splendidone     1        1        1        1        1        1
2    1B1         A.dent             2        2        2        2        2        2
…

For each chosen gene a corresponding FASTA file containing the aligned sequences for all included isolates is needed:

name for each sequence: >“Gene name”-“ST”
unknown bases: “?” or “-“
sequences within a single gene should be of equal length

example:

>RecA-2
CTAGGGCTTTAACCC--CATTTGCAGTACTGTCATGTCAGTGTACTATTTCAC
>RecA-2
CTAGGGCTTT-ACCCT-CATTTGCAGTACTGCCATGTCACTGTACTAATTCAC

BAPS data format:

haploid marker data (single data row per individual)
diploid marker data (two rows per individual)
tetraploid marker data (four rows per individual)

numeric data input format or a direct sequence based format:

numeric format:
- replacing each of A,C,G,T with a unique integer and missing values with a negative integer (-999)
- Individual Index after the sequence separated by a space
- example: a single data row for individual 110 with sequence AACCG-T could lool like this:
```
65 65 67 67 71 -999 84 110
```

sequence format:

Individual Index after the sequence separated by a space

example (diploid):

ATTTGCCTACGTAGCCAATT 1
TTACCGACCTTAAAAACCTT 1
ATTTCCCAAAGGGTTTAAAA 2
TAACCGGACATAGCCAATAA 2

In contrast to the MLST format you need under the BAPS format to concatenate the sequences from all considered genes into a single one and tell the program about the gene boundaries in a separate file. Separate file of gene boundaries:
- number of rows equals the number of genes
- at each row, the integers refer to those columns of the data matrix that correspond to the specific gene
- Additional zeros are used to fill the rows to have an equal number of colummns

example (“linkage map”: 3 genes the first corresponding to the columns 1-10 in the data matrix and so on. Additional zeros result in a matrix having an equal number of columns for each row):

1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 0
20 21 22 23 24 25 26 27 0 0

linked molecular marker data: should be formatted as haploid marker data for the other clustering modules (see previous). For marker data each “gene” in the previous example should be replaced by a linkage group and the other aspects of the formatting are kept equal

Admixture of individuals based on mixture clustering

Binary result file of mixture clustering

Admixture based on pre-defined clustering

BASP format:

Similar to those used in the clustering of individuals
additional file contains the partition of the individuals:
- same numbers of rows as there are individuals
- each row with an index (from 1 to number of clusters) identifying the cluster to which the individuals belongs
- individuals not pre assigned to any cluster: -1

example (First two individuals are assumed to form one cluster whose ID label is 1, individual 3 is not pre assigned to either cluster and so on):

Data/name/index file: see example clustering of individuals
Partition file:
```
1
1
-1
2
2
```

Genepop format:

Similar to those used in the clustering of individuals
here the populations of individuals in the data (separated by “pop”) are used to define the partition of individuals

How to cite

Tang J, Hanage WP, Fraser C, Corander J. (2009). Identifying currents in the gene pool for bacterial populations using an integrative approach. PLoS Computational Biology, 5(8): e1000455.
Corander, J., Waldmann, P., Marttinen, P. and Sillanpää, M.J. (2004). BAPS 2: enhanced possibilities for the analysis of genetic population structure, Bioinformatics, 20, 2363-2369.

Table of Contents

BAPS

Program information

Data type handled

Input Files

Clustering of individuals:

BAPS format:

GENEPOP format:

Clustering of groups of individuals:

BAPS format:

GENEPOP format:

Trained clustering:

Spatial clustering:

Clustering of linked molecular data (sequence data):

MLST data format:

BAPS data format:

Admixture of individuals based on mixture clustering

Admixture based on pre-defined clustering

BASP format:

Genepop format:

How to cite