====== BAPS ======
**[[http://web.abo.fi/fak/mnf//mate/jc/software/baps.html|BAPS]]**\\
[[http://web.abo.fi/fak/mnf//mate/jc/software/BAPS5manual.pdf|manual]]

\\
Version 5.4 (29.04.2010)\\
A program for Bayesian inference of the genetic structure in a population. Assigns individuals to genetic clusters by either considering them as immigrants (mixture analysis) or ad descendants from immigrants (admixture analysis).


===== Program information =====
  * Windows XP/Vista/7 (32-bit, 64-bit)
  * Mac Snow leopard OS X (64-bit)
  * Linux (32-bit)


===== Data type handled =====
  * haploid/diploid/(tetraploid)
  * DNA
  * SNP (sequence/numeric)
  * AFLP
  * Microsatellite
  * Standard (multi-allelic markers)


===== Input Files =====


==== Clustering of individuals: ====
===BAPS format:===
  * data matrix:
    * columns: loci at which the individuals were observed
    * rows: individuals
    * additional column in the right end of the matrix: contains on each row the index of the individual whose alleles are presented on the row. There can be more than one row per individual (e.g. in diploids)
  * alleles: indexed with any non-negative integer 
  * individuals: indices start with 1 for the firs individual and end with the value that corresponds to the total number of individuals
  * missing allele: negative integer (e.g. -999 or -9)
  * population of the individuals are known: two additional file: one containing the names of the populations, the other containing the indices of the first individuals of each sampling populations

\\
**example** (cluster 5 diploid individuals. The first individual has alleles 5 and 7 at the first locus and so on. Individuals 1, 2 and 3 were sampled in America and individuals 4 and 5 in Europe):
 
  * data file: <code>
5         2         1
7         2         1
5         8         2
3         9         2
2         5         3
-999      5         3
5         -999      4
2         3         4
3         8         5
2         5         5
</code>

  * name file: <code>
American
European
</code>

  * Index file: <code>
1
4
</code>

=== GENEPOP format: ===
  * [[GENEPOP]]
  * BAPS uses the labels of the first individuals of the populations as names for the populations

\\


==== Clustering of groups of individuals: ====
=== BAPS format: ===
  * Like above
  * Last column contains the index of the group that is the origin of the alleles on the particular row (instead of specifying the individual)
  * the names of can be given in a seperate file

\\
**example** (data from four distinct groups)
  * data file:<code>
5     2     1
7     2     1
5     8     1
3     9     2
2     5     2
-999  5     3
5     -999  4
2     3     4
3     8     4
2     5     4
</code>

  * name file:<code>
American
European
African
Asian
</code>

=== GENEPOP format: ===
  * like above

\\

==== Trained clustering: ====
Must provide two data files:
  * reference individuals whose origins are  known
  * sampling units (individuals or groups of individuals) that you wish to cluster
\\
  * both in [[GENEPOP]] format.  Individuals in one population (separated by “pop”) in reference data file correspond to individuals from a single origin. 
  * in both data files all individuals should be given names. These names will be needed by the program when the output is written

\\
**example** (reference data from two populations (s, r). We wish to cluster three sampling units (1unit: ind1,…). If there is no relevant information for such pre-grouping of the data to be clustered, then every individual should be one sampling unit in the input data set):

  * reference data file: <code>
--individuals with known origins--
loc1, loc2
pop
s1, 0307 0202
s2, 0303 0201
pop
r1, 0502 0401
r2, 0200 0404
</code>

  * file specifying the sampling units: <code>
--sampling units--
loc1, loc2
pop 
ind1, 0404 0304
pop
ind2, 0307 0202
ind3, 0303 0102
pop
ind4, 0505 0404
</code>

\\


==== Spatial clustering: ====
Same as the first two above, except for the coordinate values that need to be given in a separate file:
  * as many rows as there are individuals (spatial clustering of individuals -> sampling coordinates of each individual) or groups (spatial clustering of groups -> sampling coordinates of each group) in the molecular data set.
  * missing coordinate: two consecutive zeros

\\
**example:**
  * Data file: see first example
  * Coordinate file: <code>
172  88
155  96
180  78
0    0
-18  81
</code>

\\


==== Clustering of linked molecular data (sequence data): ====
=== MLST data format: ===
  * for prokaryotik organism
  * first column: identifier where the numbering should go linearly from 1 to number of isolates (unique for each)
  * second column: unique ID label for each isolate (for printing results). The header could either be “Isolate” or “Strain”
  * third column (optional): provides a species or similar group name for the isolates
  * remaining columns: genes for which there are aligned sequences available
  * if header is given: columns can be in different order

**example:** <code>
ST   Isolate     Species            Adk      GyrB     Hsp60    Mdh      Pgi      RecA
1    1A1         My.Splendidone     1        1        1        1        1        1
2    1B1         A.dent             2        2        2        2        2        2
…
</code>

For each chosen gene a corresponding [[FASTA]] file containing the aligned sequences for all included isolates is needed:
  * name for each sequence: >“Gene name”-“ST”
  * unknown bases: “?” or “-“
  * sequences within a single gene should be of equal length 

**example:** <code>
>RecA-2
CTAGGGCTTTAACCC--CATTTGCAGTACTGTCATGTCAGTGTACTATTTCAC
>RecA-2
CTAGGGCTTT-ACCCT-CATTTGCAGTACTGCCATGTCACTGTACTAATTCAC
</code>

\\
=== BAPS data format: ===
  * haploid marker data (single data row per individual)
  * diploid marker data (two rows per individual)
  * tetraploid marker data (four rows per individual)

\\
numeric data input format or a direct sequence based format:
  * **numeric format:** 
    * replacing each of A,C,G,T with a unique integer and missing values with a negative integer (-999)
    * Individual Index after the sequence separated by a space
    * example: a single data row for individual 110 with sequence AACCG-T could lool like this: <code>
65 65 67 67 71 -999 84 110
</code>

  * **sequence format:** 
    * Individual Index after the sequence separated by a space
    * example (diploid): <code>
ATTTGCCTACGTAGCCAATT 1
TTACCGACCTTAAAAACCTT 1
ATTTCCCAAAGGGTTTAAAA 2
TAACCGGACATAGCCAATAA 2
</code>

\\
  * In contrast to the MLST format you need under the BAPS format to concatenate the sequences from all considered genes into a single one and tell the program about the gene boundaries in a separate file. Separate file of gene boundaries:
    * number of rows equals the number of genes
    * at each row, the integers refer to those columns of the data matrix that correspond to the specific gene
    * Additional zeros are used to fill the rows to have an equal number of colummns

**example** (“linkage map”: 3 genes the first corresponding to the columns 1-10 in the data matrix and so on. Additional zeros result in a matrix having an equal number of columns for each row): <code>
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 0
20 21 22 23 24 25 26 27 0 0
</code>

  * linked molecular marker data: should be formatted as haploid marker data for the other clustering modules (see previous). For marker data each “gene” in the previous example should be replaced by a linkage group and the other aspects of the formatting are kept equal

\\

==== Admixture of individuals based on mixture clustering ====
Binary result file of mixture clustering


==== Admixture based on pre-defined clustering ====
=== BASP format: ===
  * Similar to those used in the clustering of individuals
  * additional file contains the partition of the individuals:
    * same numbers of rows as there are individuals
    * each row with an index (from 1 to number of clusters) identifying the cluster to which the individuals belongs 
    * individuals not  pre assigned to any cluster: -1

\\
**example** (First two individuals are assumed to form one cluster whose ID label is 1, individual 3 is not pre assigned to either cluster and so on):
  * Data/name/index file: see example clustering of individuals
  * Partition file: <code>
1
1
-1
2
2
</code>

=== Genepop format: ===
  * Similar to those used in the clustering of individuals
  * here the populations of individuals in the data (separated by “pop”) are used to define the partition of individuals


===== How to cite =====
Tang J, Hanage WP, Fraser C, Corander J. (2009). Identifying currents in the gene pool for bacterial populations using an integrative approach. PLoS Computational Biology, 5(8): e1000455.
\\
Corander, J., Waldmann, P., Marttinen, P. and Sillanpää, M.J. (2004).  BAPS 2: enhanced possibilities for the analysis of genetic population structure, Bioinformatics,  20, 2363-2369.