====== BAPS ====== **[[http://web.abo.fi/fak/mnf//mate/jc/software/baps.html|BAPS]]**\\ [[http://web.abo.fi/fak/mnf//mate/jc/software/BAPS5manual.pdf|manual]] \\ Version 5.4 (29.04.2010)\\ A program for Bayesian inference of the genetic structure in a population. Assigns individuals to genetic clusters by either considering them as immigrants (mixture analysis) or ad descendants from immigrants (admixture analysis). ===== Program information ===== * Windows XP/Vista/7 (32-bit, 64-bit) * Mac Snow leopard OS X (64-bit) * Linux (32-bit) ===== Data type handled ===== * haploid/diploid/(tetraploid) * DNA * SNP (sequence/numeric) * AFLP * Microsatellite * Standard (multi-allelic markers) ===== Input Files ===== ==== Clustering of individuals: ==== ===BAPS format:=== * data matrix: * columns: loci at which the individuals were observed * rows: individuals * additional column in the right end of the matrix: contains on each row the index of the individual whose alleles are presented on the row. There can be more than one row per individual (e.g. in diploids) * alleles: indexed with any non-negative integer * individuals: indices start with 1 for the firs individual and end with the value that corresponds to the total number of individuals * missing allele: negative integer (e.g. -999 or -9) * population of the individuals are known: two additional file: one containing the names of the populations, the other containing the indices of the first individuals of each sampling populations \\ **example** (cluster 5 diploid individuals. The first individual has alleles 5 and 7 at the first locus and so on. Individuals 1, 2 and 3 were sampled in America and individuals 4 and 5 in Europe): * data file: 5 2 1 7 2 1 5 8 2 3 9 2 2 5 3 -999 5 3 5 -999 4 2 3 4 3 8 5 2 5 5 * name file: American European * Index file: 1 4 === GENEPOP format: === * [[GENEPOP]] * BAPS uses the labels of the first individuals of the populations as names for the populations \\ ==== Clustering of groups of individuals: ==== === BAPS format: === * Like above * Last column contains the index of the group that is the origin of the alleles on the particular row (instead of specifying the individual) * the names of can be given in a seperate file \\ **example** (data from four distinct groups) * data file: 5 2 1 7 2 1 5 8 1 3 9 2 2 5 2 -999 5 3 5 -999 4 2 3 4 3 8 4 2 5 4 * name file: American European African Asian === GENEPOP format: === * like above \\ ==== Trained clustering: ==== Must provide two data files: * reference individuals whose origins are known * sampling units (individuals or groups of individuals) that you wish to cluster \\ * both in [[GENEPOP]] format. Individuals in one population (separated by “pop”) in reference data file correspond to individuals from a single origin. * in both data files all individuals should be given names. These names will be needed by the program when the output is written \\ **example** (reference data from two populations (s, r). We wish to cluster three sampling units (1unit: ind1,…). If there is no relevant information for such pre-grouping of the data to be clustered, then every individual should be one sampling unit in the input data set): * reference data file: --individuals with known origins-- loc1, loc2 pop s1, 0307 0202 s2, 0303 0201 pop r1, 0502 0401 r2, 0200 0404 * file specifying the sampling units: --sampling units-- loc1, loc2 pop ind1, 0404 0304 pop ind2, 0307 0202 ind3, 0303 0102 pop ind4, 0505 0404 \\ ==== Spatial clustering: ==== Same as the first two above, except for the coordinate values that need to be given in a separate file: * as many rows as there are individuals (spatial clustering of individuals -> sampling coordinates of each individual) or groups (spatial clustering of groups -> sampling coordinates of each group) in the molecular data set. * missing coordinate: two consecutive zeros \\ **example:** * Data file: see first example * Coordinate file: 172 88 155 96 180 78 0 0 -18 81 \\ ==== Clustering of linked molecular data (sequence data): ==== === MLST data format: === * for prokaryotik organism * first column: identifier where the numbering should go linearly from 1 to number of isolates (unique for each) * second column: unique ID label for each isolate (for printing results). The header could either be “Isolate” or “Strain” * third column (optional): provides a species or similar group name for the isolates * remaining columns: genes for which there are aligned sequences available * if header is given: columns can be in different order **example:** ST Isolate Species Adk GyrB Hsp60 Mdh Pgi RecA 1 1A1 My.Splendidone 1 1 1 1 1 1 2 1B1 A.dent 2 2 2 2 2 2 … For each chosen gene a corresponding [[FASTA]] file containing the aligned sequences for all included isolates is needed: * name for each sequence: >“Gene name”-“ST” * unknown bases: “?” or “-“ * sequences within a single gene should be of equal length **example:** >RecA-2 CTAGGGCTTTAACCC--CATTTGCAGTACTGTCATGTCAGTGTACTATTTCAC >RecA-2 CTAGGGCTTT-ACCCT-CATTTGCAGTACTGCCATGTCACTGTACTAATTCAC \\ === BAPS data format: === * haploid marker data (single data row per individual) * diploid marker data (two rows per individual) * tetraploid marker data (four rows per individual) \\ numeric data input format or a direct sequence based format: * **numeric format:** * replacing each of A,C,G,T with a unique integer and missing values with a negative integer (-999) * Individual Index after the sequence separated by a space * example: a single data row for individual 110 with sequence AACCG-T could lool like this: 65 65 67 67 71 -999 84 110 * **sequence format:** * Individual Index after the sequence separated by a space * example (diploid): ATTTGCCTACGTAGCCAATT 1 TTACCGACCTTAAAAACCTT 1 ATTTCCCAAAGGGTTTAAAA 2 TAACCGGACATAGCCAATAA 2 \\ * In contrast to the MLST format you need under the BAPS format to concatenate the sequences from all considered genes into a single one and tell the program about the gene boundaries in a separate file. Separate file of gene boundaries: * number of rows equals the number of genes * at each row, the integers refer to those columns of the data matrix that correspond to the specific gene * Additional zeros are used to fill the rows to have an equal number of colummns **example** (“linkage map”: 3 genes the first corresponding to the columns 1-10 in the data matrix and so on. Additional zeros result in a matrix having an equal number of columns for each row): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 20 21 22 23 24 25 26 27 0 0 * linked molecular marker data: should be formatted as haploid marker data for the other clustering modules (see previous). For marker data each “gene” in the previous example should be replaced by a linkage group and the other aspects of the formatting are kept equal \\ ==== Admixture of individuals based on mixture clustering ==== Binary result file of mixture clustering ==== Admixture based on pre-defined clustering ==== === BASP format: === * Similar to those used in the clustering of individuals * additional file contains the partition of the individuals: * same numbers of rows as there are individuals * each row with an index (from 1 to number of clusters) identifying the cluster to which the individuals belongs * individuals not pre assigned to any cluster: -1 \\ **example** (First two individuals are assumed to form one cluster whose ID label is 1, individual 3 is not pre assigned to either cluster and so on): * Data/name/index file: see example clustering of individuals * Partition file: 1 1 -1 2 2 === Genepop format: === * Similar to those used in the clustering of individuals * here the populations of individuals in the data (separated by “pop”) are used to define the partition of individuals ===== How to cite ===== Tang J, Hanage WP, Fraser C, Corander J. (2009). Identifying currents in the gene pool for bacterial populations using an integrative approach. PLoS Computational Biology, 5(8): e1000455. \\ Corander, J., Waldmann, P., Marttinen, P. and Sillanpää, M.J. (2004). BAPS 2: enhanced possibilities for the analysis of genetic population structure, Bioinformatics, 20, 2363-2369.