====== BAPS ======
**[[http://web.abo.fi/fak/mnf//mate/jc/software/baps.html|BAPS]]**\\
[[http://web.abo.fi/fak/mnf//mate/jc/software/BAPS5manual.pdf|manual]]
\\
Version 5.4 (29.04.2010)\\
A program for Bayesian inference of the genetic structure in a population. Assigns individuals to genetic clusters by either considering them as immigrants (mixture analysis) or ad descendants from immigrants (admixture analysis).
===== Program information =====
* Windows XP/Vista/7 (32-bit, 64-bit)
* Mac Snow leopard OS X (64-bit)
* Linux (32-bit)
===== Data type handled =====
* haploid/diploid/(tetraploid)
* DNA
* SNP (sequence/numeric)
* AFLP
* Microsatellite
* Standard (multi-allelic markers)
===== Input Files =====
==== Clustering of individuals: ====
===BAPS format:===
* data matrix:
* columns: loci at which the individuals were observed
* rows: individuals
* additional column in the right end of the matrix: contains on each row the index of the individual whose alleles are presented on the row. There can be more than one row per individual (e.g. in diploids)
* alleles: indexed with any non-negative integer
* individuals: indices start with 1 for the firs individual and end with the value that corresponds to the total number of individuals
* missing allele: negative integer (e.g. -999 or -9)
* population of the individuals are known: two additional file: one containing the names of the populations, the other containing the indices of the first individuals of each sampling populations
\\
**example** (cluster 5 diploid individuals. The first individual has alleles 5 and 7 at the first locus and so on. Individuals 1, 2 and 3 were sampled in America and individuals 4 and 5 in Europe):
* data file:
5 2 1
7 2 1
5 8 2
3 9 2
2 5 3
-999 5 3
5 -999 4
2 3 4
3 8 5
2 5 5
* name file:
American
European
* Index file:
1
4
=== GENEPOP format: ===
* [[GENEPOP]]
* BAPS uses the labels of the first individuals of the populations as names for the populations
\\
==== Clustering of groups of individuals: ====
=== BAPS format: ===
* Like above
* Last column contains the index of the group that is the origin of the alleles on the particular row (instead of specifying the individual)
* the names of can be given in a seperate file
\\
**example** (data from four distinct groups)
* data file:
5 2 1
7 2 1
5 8 1
3 9 2
2 5 2
-999 5 3
5 -999 4
2 3 4
3 8 4
2 5 4
* name file:
American
European
African
Asian
=== GENEPOP format: ===
* like above
\\
==== Trained clustering: ====
Must provide two data files:
* reference individuals whose origins are known
* sampling units (individuals or groups of individuals) that you wish to cluster
\\
* both in [[GENEPOP]] format. Individuals in one population (separated by “pop”) in reference data file correspond to individuals from a single origin.
* in both data files all individuals should be given names. These names will be needed by the program when the output is written
\\
**example** (reference data from two populations (s, r). We wish to cluster three sampling units (1unit: ind1,…). If there is no relevant information for such pre-grouping of the data to be clustered, then every individual should be one sampling unit in the input data set):
* reference data file:
--individuals with known origins--
loc1, loc2
pop
s1, 0307 0202
s2, 0303 0201
pop
r1, 0502 0401
r2, 0200 0404
* file specifying the sampling units:
--sampling units--
loc1, loc2
pop
ind1, 0404 0304
pop
ind2, 0307 0202
ind3, 0303 0102
pop
ind4, 0505 0404
\\
==== Spatial clustering: ====
Same as the first two above, except for the coordinate values that need to be given in a separate file:
* as many rows as there are individuals (spatial clustering of individuals -> sampling coordinates of each individual) or groups (spatial clustering of groups -> sampling coordinates of each group) in the molecular data set.
* missing coordinate: two consecutive zeros
\\
**example:**
* Data file: see first example
* Coordinate file:
172 88
155 96
180 78
0 0
-18 81
\\
==== Clustering of linked molecular data (sequence data): ====
=== MLST data format: ===
* for prokaryotik organism
* first column: identifier where the numbering should go linearly from 1 to number of isolates (unique for each)
* second column: unique ID label for each isolate (for printing results). The header could either be “Isolate” or “Strain”
* third column (optional): provides a species or similar group name for the isolates
* remaining columns: genes for which there are aligned sequences available
* if header is given: columns can be in different order
**example:**
ST Isolate Species Adk GyrB Hsp60 Mdh Pgi RecA
1 1A1 My.Splendidone 1 1 1 1 1 1
2 1B1 A.dent 2 2 2 2 2 2
…
For each chosen gene a corresponding [[FASTA]] file containing the aligned sequences for all included isolates is needed:
* name for each sequence: >“Gene name”-“ST”
* unknown bases: “?” or “-“
* sequences within a single gene should be of equal length
**example:**
>RecA-2
CTAGGGCTTTAACCC--CATTTGCAGTACTGTCATGTCAGTGTACTATTTCAC
>RecA-2
CTAGGGCTTT-ACCCT-CATTTGCAGTACTGCCATGTCACTGTACTAATTCAC
\\
=== BAPS data format: ===
* haploid marker data (single data row per individual)
* diploid marker data (two rows per individual)
* tetraploid marker data (four rows per individual)
\\
numeric data input format or a direct sequence based format:
* **numeric format:**
* replacing each of A,C,G,T with a unique integer and missing values with a negative integer (-999)
* Individual Index after the sequence separated by a space
* example: a single data row for individual 110 with sequence AACCG-T could lool like this:
65 65 67 67 71 -999 84 110
* **sequence format:**
* Individual Index after the sequence separated by a space
* example (diploid):
ATTTGCCTACGTAGCCAATT 1
TTACCGACCTTAAAAACCTT 1
ATTTCCCAAAGGGTTTAAAA 2
TAACCGGACATAGCCAATAA 2
\\
* In contrast to the MLST format you need under the BAPS format to concatenate the sequences from all considered genes into a single one and tell the program about the gene boundaries in a separate file. Separate file of gene boundaries:
* number of rows equals the number of genes
* at each row, the integers refer to those columns of the data matrix that correspond to the specific gene
* Additional zeros are used to fill the rows to have an equal number of colummns
**example** (“linkage map”: 3 genes the first corresponding to the columns 1-10 in the data matrix and so on. Additional zeros result in a matrix having an equal number of columns for each row):
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 0
20 21 22 23 24 25 26 27 0 0
* linked molecular marker data: should be formatted as haploid marker data for the other clustering modules (see previous). For marker data each “gene” in the previous example should be replaced by a linkage group and the other aspects of the formatting are kept equal
\\
==== Admixture of individuals based on mixture clustering ====
Binary result file of mixture clustering
==== Admixture based on pre-defined clustering ====
=== BASP format: ===
* Similar to those used in the clustering of individuals
* additional file contains the partition of the individuals:
* same numbers of rows as there are individuals
* each row with an index (from 1 to number of clusters) identifying the cluster to which the individuals belongs
* individuals not pre assigned to any cluster: -1
\\
**example** (First two individuals are assumed to form one cluster whose ID label is 1, individual 3 is not pre assigned to either cluster and so on):
* Data/name/index file: see example clustering of individuals
* Partition file:
1
1
-1
2
2
=== Genepop format: ===
* Similar to those used in the clustering of individuals
* here the populations of individuals in the data (separated by “pop”) are used to define the partition of individuals
===== How to cite =====
Tang J, Hanage WP, Fraser C, Corander J. (2009). Identifying currents in the gene pool for bacterial populations using an integrative approach. PLoS Computational Biology, 5(8): e1000455.
\\
Corander, J., Waldmann, P., Marttinen, P. and Sillanpää, M.J. (2004). BAPS 2: enhanced possibilities for the analysis of genetic population structure, Bioinformatics, 20, 2363-2369.