User Tools

Site Tools


baps

BAPS

BAPS
manual


Version 5.4 (29.04.2010)
A program for Bayesian inference of the genetic structure in a population. Assigns individuals to genetic clusters by either considering them as immigrants (mixture analysis) or ad descendants from immigrants (admixture analysis).

Program information

  • Windows XP/Vista/7 (32-bit, 64-bit)
  • Mac Snow leopard OS X (64-bit)
  • Linux (32-bit)

Data type handled

  • haploid/diploid/(tetraploid)
  • DNA
  • SNP (sequence/numeric)
  • AFLP
  • Microsatellite
  • Standard (multi-allelic markers)

Input Files

Clustering of individuals:

BAPS format:

  • data matrix:
    • columns: loci at which the individuals were observed
    • rows: individuals
    • additional column in the right end of the matrix: contains on each row the index of the individual whose alleles are presented on the row. There can be more than one row per individual (e.g. in diploids)
  • alleles: indexed with any non-negative integer
  • individuals: indices start with 1 for the firs individual and end with the value that corresponds to the total number of individuals
  • missing allele: negative integer (e.g. -999 or -9)
  • population of the individuals are known: two additional file: one containing the names of the populations, the other containing the indices of the first individuals of each sampling populations


example (cluster 5 diploid individuals. The first individual has alleles 5 and 7 at the first locus and so on. Individuals 1, 2 and 3 were sampled in America and individuals 4 and 5 in Europe):

  • data file:
    5         2         1
    7         2         1
    5         8         2
    3         9         2
    2         5         3
    -999      5         3
    5         -999      4
    2         3         4
    3         8         5
    2         5         5
  • name file:
    American
    European
  • Index file:
    1
    4

GENEPOP format:

  • BAPS uses the labels of the first individuals of the populations as names for the populations


Clustering of groups of individuals:

BAPS format:

  • Like above
  • Last column contains the index of the group that is the origin of the alleles on the particular row (instead of specifying the individual)
  • the names of can be given in a seperate file


example (data from four distinct groups)

  • data file:
    5     2     1
    7     2     1
    5     8     1
    3     9     2
    2     5     2
    -999  5     3
    5     -999  4
    2     3     4
    3     8     4
    2     5     4
  • name file:
    American
    European
    African
    Asian

GENEPOP format:

  • like above


Trained clustering:

Must provide two data files:

  • reference individuals whose origins are known
  • sampling units (individuals or groups of individuals) that you wish to cluster


  • both in GENEPOP format. Individuals in one population (separated by “pop”) in reference data file correspond to individuals from a single origin.
  • in both data files all individuals should be given names. These names will be needed by the program when the output is written


example (reference data from two populations (s, r). We wish to cluster three sampling units (1unit: ind1,…). If there is no relevant information for such pre-grouping of the data to be clustered, then every individual should be one sampling unit in the input data set):

  • reference data file:
    --individuals with known origins--
    loc1, loc2
    pop
    s1, 0307 0202
    s2, 0303 0201
    pop
    r1, 0502 0401
    r2, 0200 0404
  • file specifying the sampling units:
    --sampling units--
    loc1, loc2
    pop 
    ind1, 0404 0304
    pop
    ind2, 0307 0202
    ind3, 0303 0102
    pop
    ind4, 0505 0404


Spatial clustering:

Same as the first two above, except for the coordinate values that need to be given in a separate file:

  • as many rows as there are individuals (spatial clustering of individuals → sampling coordinates of each individual) or groups (spatial clustering of groups → sampling coordinates of each group) in the molecular data set.
  • missing coordinate: two consecutive zeros


example:

  • Data file: see first example
  • Coordinate file:
    172  88
    155  96
    180  78
    0    0
    -18  81


Clustering of linked molecular data (sequence data):

MLST data format:

  • for prokaryotik organism
  • first column: identifier where the numbering should go linearly from 1 to number of isolates (unique for each)
  • second column: unique ID label for each isolate (for printing results). The header could either be “Isolate” or “Strain”
  • third column (optional): provides a species or similar group name for the isolates
  • remaining columns: genes for which there are aligned sequences available
  • if header is given: columns can be in different order

example:

ST   Isolate     Species            Adk      GyrB     Hsp60    Mdh      Pgi      RecA
1    1A1         My.Splendidone     1        1        1        1        1        1
2    1B1         A.dent             2        2        2        2        2        2
…

For each chosen gene a corresponding FASTA file containing the aligned sequences for all included isolates is needed:

  • name for each sequence: >“Gene name”-“ST”
  • unknown bases: “?” or “-“
  • sequences within a single gene should be of equal length

example:

>RecA-2
CTAGGGCTTTAACCC--CATTTGCAGTACTGTCATGTCAGTGTACTATTTCAC
>RecA-2
CTAGGGCTTT-ACCCT-CATTTGCAGTACTGCCATGTCACTGTACTAATTCAC


BAPS data format:

  • haploid marker data (single data row per individual)
  • diploid marker data (two rows per individual)
  • tetraploid marker data (four rows per individual)


numeric data input format or a direct sequence based format:

  • numeric format:
    • replacing each of A,C,G,T with a unique integer and missing values with a negative integer (-999)
    • Individual Index after the sequence separated by a space
    • example: a single data row for individual 110 with sequence AACCG-T could lool like this:
      65 65 67 67 71 -999 84 110
  • sequence format:
    • Individual Index after the sequence separated by a space
    • example (diploid):
      ATTTGCCTACGTAGCCAATT 1
      TTACCGACCTTAAAAACCTT 1
      ATTTCCCAAAGGGTTTAAAA 2
      TAACCGGACATAGCCAATAA 2


  • In contrast to the MLST format you need under the BAPS format to concatenate the sequences from all considered genes into a single one and tell the program about the gene boundaries in a separate file. Separate file of gene boundaries:
    • number of rows equals the number of genes
    • at each row, the integers refer to those columns of the data matrix that correspond to the specific gene
    • Additional zeros are used to fill the rows to have an equal number of colummns

example (“linkage map”: 3 genes the first corresponding to the columns 1-10 in the data matrix and so on. Additional zeros result in a matrix having an equal number of columns for each row):

1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 0
20 21 22 23 24 25 26 27 0 0
  • linked molecular marker data: should be formatted as haploid marker data for the other clustering modules (see previous). For marker data each “gene” in the previous example should be replaced by a linkage group and the other aspects of the formatting are kept equal


Admixture of individuals based on mixture clustering

Binary result file of mixture clustering

Admixture based on pre-defined clustering

BASP format:

  • Similar to those used in the clustering of individuals
  • additional file contains the partition of the individuals:
    • same numbers of rows as there are individuals
    • each row with an index (from 1 to number of clusters) identifying the cluster to which the individuals belongs
    • individuals not pre assigned to any cluster: -1


example (First two individuals are assumed to form one cluster whose ID label is 1, individual 3 is not pre assigned to either cluster and so on):

  • Data/name/index file: see example clustering of individuals
  • Partition file:
    1
    1
    -1
    2
    2

Genepop format:

  • Similar to those used in the clustering of individuals
  • here the populations of individuals in the data (separated by “pop”) are used to define the partition of individuals

How to cite

Tang J, Hanage WP, Fraser C, Corander J. (2009). Identifying currents in the gene pool for bacterial populations using an integrative approach. PLoS Computational Biology, 5(8): e1000455.
Corander, J., Waldmann, P., Marttinen, P. and Sillanpää, M.J. (2004). BAPS 2: enhanced possibilities for the analysis of genetic population structure, Bioinformatics, 20, 2363-2369.

baps.txt · Last modified: 2013/02/20 13:24 by heidi