This is an old revision of the document!
Table of Contents
population genetics data format
I will investigate the possibility to develop a new population genetics data format, which should facilitate the transfer of data among several population genetic software packages.
data formats
file formats
database formats
software
converter:
new data format
- in XML
- root element: <PGD>
1. version:
| data type | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| block | subtags | standard | DNA | SNP | Microsattelite | RFLP (AFLP) | Allel frequency | ||
| header (attribute: title) | number_samples | x | x | x | x | x | x | ||
| data_type | x | x | x | x | x | x | |||
| number_loci | number | x | x | x | x | x | |||
| loci (attribute: name) | locations | o | o | o | o | o | |||
| length | o | o | o | o | o | ||||
| missing | o | o | o | o | o | ||||
| gap | o | o | o | o | o | ||||
| gametic_phase | o | o | o | o | o | ||||
| recessive_alleles | o | o | o | o | o | ||||
| sample (attribute: name) | size | x | x | x | x | x | x | ||
| id (attribute: name) | frequency | x | x | x | x | x | x | ||
| loci (attribute: name) (when unaligned) | position | o | o | o | |||||
| genotype/haplotype (a) | x(a) | x(a) | x(a) | ||||||
| genotype/haplotype (a) (attribute: name) (when aligned) | x | x(a) | x(a) | x | x(a) | ||||
| structure (attribute: name) | number_groups | o | o | o | o | o | |||
| group (attribute: name) (sample name, sample name) | o | o | o | o | o | ||||
| distance_matrix (attribute: name) | size | o | o | o | o | o | |||
| lables | o | o | o | o | o | ||||
| matrix | o | o | o | o | o | ||||
x: obligatory
o: optional
a: alternative
2. version:
| data type | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| block | subtags | standard | DNA | SNP | Microsattelite | RFLP (AFLP) | Allel frequency | ||
| header (attribute: title) | number_population | x | x | x | x | x | x | ||
| data_type | x | x | x | x | x | x | |||
| number_loci* | number | x | x | x | x | x | |||
| loci (attribute: name) | locations | o | o | o | o | o | |||
| length | o | o | o | o | o | ||||
| missing | o | o | o | o | o | ||||
| gap | o | o | o | o | o | ||||
| gametic_phase | o | o | o | o | o | ||||
| recessive_alleles | o | o | o | o | o | ||||
| population (attribute: name) | size | x | x | x | x | x | x | ||
| geographic_coord | o | o | o | o | o | o | |||
| id (attribute: name) (a3) | frequency | x | x | x | x | x | x | ||
| loci (attribute: name) (when unaligned) | location | o | o | o | |||||
| begin | o | o | o | ||||||
| length | o | o | o | ||||||
| genotype/haplotype (a1) | x (a1) | x (a1) | x (a1) | ||||||
| genotype/haplotype (a1) (attribute: name) (when aligned) | x | x (a1) | x (a1) | x | x (a1) | ||||
| ind (attribute: name) (a3) | geographic_coord | o | o | o | o | o | o | ||
| loci (attribute: name) (when unaligned) | location | o | o | o | |||||
| begin | o | o | o | ||||||
| length | o | o | o | ||||||
| genotype/haplotype (a2) | x (a2) | x (a2) | x (a2) | ||||||
| genotype/haplotype (a2) (attribute: name) (when aligned) | x | x (a2) | x (a2) | x | x (a2) | ||||
| structure (attribute: name) | number_groups | o | o | o | o | o | |||
| group (attribute: name) (sample name, sample name) | o | o | o | o | o | ||||
| distance_matrix (attribute: name) | size | o | o | o | o | o | |||
| lables | o | o | o | o | o | ||||
| matrix | o | o | o | o | o | ||||
* when for all populations the same
x: obligatory
o: optional
a: alternative to
3. version:
| data type | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| block | subtags | standard | DNA | SNP | Microsattelite | RFLP (AFLP) | Allel frequency | ||
| header (attribute: title) | number_populations | x | x | x | x | x | x | ||
| data_type | x | x | x | x | x | x | |||
| number_loci * | number | x | x | x | x | x | |||
| loci (attribute: name) | location | o | o | o | o | o | |||
| length | o | o | o | o | o | ||||
| gene_copies * | x | x (a1) | x (a1) | x | x (a1) | ||||
| aligned | x | x | x | x | x | ||||
| genotypic_data | x | x | x | x | x | ||||
| missing | x | x | x | x | x | ||||
| gap | x | x | x | x | x | ||||
| gametic_phase | o | o | o | o | o | ||||
| recessive_alleles | o | o | o | o | o | ||||
| population (attribute: name) | size | x | x | x | x | x | x | ||
| geographic_coord | o | o | o | o | o | o | |||
| linguistic_group | o | o | o | o | o | o | |||
| id (attribute: name) (a3) | frequency | x | x | x | x | x | x | ||
| genotype/haplotype (attribute: name) | x | x | x | x | x | ||||
| ind (attribute: name) (a3) | geographic_coord | o | o | o | o | o | |||
| linguistic_group | o | o | o | o | o | ||||
| loci (attribute: name) (when unaligned) | location | o | o | o | |||||
| begin | o | o | o | ||||||
| length | o | o | o | ||||||
| gene_copies | x (a1) | x (a1) | x (a1) | ||||||
| genotype/haplotype (a2) | x (a2) | x (a2) | x (a2) | ||||||
| genotype/haplotype (a2) (attribute: name) (when aligned) | x | x (a2) | x (a2) | x | x (a2) | ||||
| structure (attribute: name) | number_groups | o | o | o | o | o | o | ||
| group (attribute: name) (sample name, sample name, …) | o | o | o | o | o | o | |||
| distance_matrix (attribute: name) | size | o | o | o | o | o | o | ||
| labels | o | o | o | o | o | o | |||
| matrix (number (line break) number, number (line break)…) | o | o | o | o | o | o | |||
* when for all populations the same
x: obligatory
o: optional
a: alternative to
4. version:
| data type | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| block | subtags | standard | DNA | SNP | Microsattelite | RFLP (AFLP) | Allel frequency | ||
| header (attribute: title) | number_populations | x | x | x | x | x | x | ||
| data_type | x | x | x | x | x | x | |||
| number_loci * | number | x | x | x | x | x | |||
| loci (attribute: name) | location | o | o | o | o | o | |||
| length | o | o | o | o | o | ||||
| number_strains * | x | x (a1) | x (a1) | x | x (a1) | ||||
| aligned | x | x | x | x | x | ||||
| genotypic_data | x | x | x | x | x | ||||
| missing | x | x | x | x | x | ||||
| gap | x | x | x | x | x | ||||
| gametic_phase | o | o | o | o | o | ||||
| recessive_alleles | o | o | o | o | o | ||||
| population (attribute: name) aligned (a2) | size | x | x | x | x | x | x | ||
| geographic_coord * | o (a4) | o (a4) | o (a4) | o (a4) | o (a4) | o | |||
| linguistic_group * | o (a5) | o (a5) | o (a5) | o (a5) | o (a5) | o | |||
| number_strains * | o (a1) | o (a1) | o (a1) | o (a1) | o (a1) | ||||
| id (attribute: name) (a3) | frequency | x | x | x | x | x | x | ||
| genotype/haplotype (attribute: name) (loci, loci,…) | x | x | x | x | x | ||||
| ind (attribute: name) (a3) | geographic_coord | o (a4) | o (a4) | o (a4) | o (a4) | o | |||
| linguistic_group | o (a5) | o (a5) | o (a5) | o (a5) | o (a5) | ||||
| number_strains | x (a1) | x (a1) | x (a1) | x (a1) | x (a1) | ||||
| genotype/haplotype (attribute: name) (loci, loci,…) | x | x | x | x | x | ||||
| population (attribute: name) unaligned (a2) | size | x | x | x | |||||
| geographic_coord * | o (a6) | o (a6) | o (a6) | ||||||
| linguistic_group * | o (a7) | o (a7) | o (a7) | ||||||
| loci_location (attribute: name) * | o (a8) | o (a8) | o (a8) | ||||||
| number_strains * | o (a9) | o (a9) | o (a9) | ||||||
| ind (attribute: name) | geographic_coord | o (a6)*3 | o (a6)*3 | o (a6)*3 | |||||
| linguistic_group | o (a7)*3 | o (a7)*3 | o (a7)*3 | ||||||
| loci_location (attribute: name)*2 | o (a8)*3 | o (a8)*3 | o (a8)*3 | ||||||
| number_strains | x (a9) | x (a9) | x (a9) | ||||||
| strain (attribute: name) | begin | x | x | x | |||||
| length | x | x | x | ||||||
| genotype/haplotype | x | x | x | ||||||
| structure (attribute: name) (o) | number_groups | x | x | x | x | x | x | ||
| group (attribute: name) (sample name, sample name, …) | x | x | x | x | x | x | |||
| distance_matrix (attribute: name) (o) | size | x | x | x | x | x | x | ||
| labels | x | x | x | x | x | x | |||
| matrix (number (line break) number, number (line break)…) | x | x | x | x | x | x | |||
* when for all populations/individuals the same
*2 if it exist –> align DNA over one individual, if it not exist –> align DNA over one population
*3 if for one individual defined, it has to be defined for all individuals
x: obligatory
o: optional
a: alternative to
- stylesheet: stylesheet_data-format.pdf
5. version:
| data type | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| block | subtags | standard | DNA | SNP | Microsattelite | RFLP (AFLP) | Allel frequency | ||
| header (attribute: title) | nb_pop | x | x | x | x | x | x | ||
| data_type | x | x | x | x | x | x | |||
| loci (o) * | nb | x | x | x | x | x | |||
| locus (attribute: id) | location | x | x | x | x | x | |||
| length | o | o | o | o | o | ||||
| links | o | o | o | o | o | ||||
| nb_reads * | x (a1) | x (a1) | x (a1) | x (a1) | x (a1) | ||||
| aligned | x | x | x | x | x | ||||
| geno_data | x | x | x | x | x | ||||
| missing | x | x | x | x | x | ||||
| gap | x | x | x | x | x | ||||
| gametic_phase | o | o | o | o | o | ||||
| recessive_alleles | o | o | o | o | o | ||||
| loci (o) | nb | x | x | x | x | x | |||
| locus (attribute: id) | location | x | x | x | x | x | |||
| length | o | o | o | o | o | ||||
| links | o | o | o | o | o | ||||
| population (attribute: name) aligned, same data type (a2) | size | x | x | x | x | x | x | ||
| geo_coord * | o (a4) | o (a4) | o (a4) | o (a4) | o (a4) | o | |||
| linguistic_group * | o (a5) | o (a5) | o (a5) | o (a5) | o (a5) | o | |||
| nb_reads * | x (a1) | x (a1) | x (a1) | x (a1) | x (a1) | ||||
| loci (locus, locus,…) * | x | x | x | x | x | ||||
| id (attribute: name) (a3) | freq | x | x | x | x | x | x | ||
| geno/hap (attribute: name) (locus, locus,…) | x | x | x | x | x | ||||
| ind (attribute: name) (a3) | geo_coord | o (a4) | o (a4) | o (a4) | o (a4) | o | |||
| linguistic_group | o (a5) | o (a5) | o (a5) | o (a5) | o (a5) | ||||
| nb_reads | x (a1) | x (a1) | x (a1) | x (a1) | x (a1) | ||||
| geno/hap (attribute: name) (locus, locus,…) | x | x | x | x | x | ||||
| population (attribute: name) unaligned (a2) | size | x | x | x | |||||
| geo_coord * | o (a6) | o (a6) | o (a6) | ||||||
| linguistic_group * | o (a7) | o (a7) | o (a7) | ||||||
| locus (attribute: id)* | o (a8) | o (a8) | o (a8) | ||||||
| nb_reads * | o (a9) | o (a9) | o (a9) | ||||||
| ind (attribute: name) | geo_coord | o (a6)*3 | o (a6)*3 | o (a6)*3 | |||||
| linguistic_group | o (a7)*3 | o (a7)*3 | o (a7)*3 | ||||||
| locus (attribute: id)*2 | o (a8)*3 | o (a8)*3 | o (a8)*3 | ||||||
| nb_reads | x (a9) | x (a9) | x (a9) | ||||||
| read (attribute: name) | start | x | x | x | |||||
| length | x | x | x | ||||||
| geno/hap | x | x | x | ||||||
| structure (attribute: name) (o) | number_groups | x | x | x | x | x | x | ||
| group (attribute: name) (sample name, sample name, …) | x | x | x | x | x | x | |||
| distance_matrix (attribute: name) (o) | size | x | x | x | x | x | x | ||
| labels | x | x | x | x | x | x | |||
| matrix (number (line break) number, number (line break)…) | x | x | x | x | x | x | |||
* when for all populations/individuals the same
*2 if it exist –> align DNA over one individual, if it not exist –> align DNA over one population
*3 if for one individual defined, it has to be defined for all individuals
x: obligatory
o: optional
a: alternative to
6 version:
| data type | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| block | subtags | standard | DNA | SNP | Microsattelite | RFLP (AFLP) | Allel frequency | ||
| header (attribute: title) | organism | x | x | x | x | x | x | ||
| numPop | x | x | x | x | x | x | |||
| numReads * | x | x | x | x | x | ||||
| aligned | x | x | x | x | x | ||||
| missing | x | x | x | x | x | ||||
| gap | x | x | x | x | x | ||||
| gameticPhase | o | o | o | o | o | ||||
| recessiveData | o | o | o | o | o | ||||
| loci (o) | lociNum | x | x | x | x | x | |||
| lociDataType * | x (a10) | x (a10) | x (a10) | x (a10) | x (a10) | x | |||
| locus (attribute: id) | locusDataType | a10 | a10 | a10 | a10 | a10 | |||
| locusChromosom | o | o | o | o | o | ||||
| locusLocation | o | o | o | o | o | ||||
| locusGenic | o | o | o | o | o | ||||
| locusLength | o | o | o | o | o | ||||
| locusLinks | o | o | o | o | o | ||||
| locusComments | o | o | o | o | o | ||||
| population (attribute: name) aligned, same data type (a2) | popSize | x | x | x | x | x | x | ||
| popGeogCoord * | o (a4) | o (a4) | o (a4) | o (a4) | o (a4) | o | |||
| popLingGroup * | o (a5) | o (a5) | o (a5) | o (a5) | o (a5) | o | |||
| popNumReads * | x (a1) | x (a1) | x (a1) | x (a1) | x (a1) | ||||
| popLoci (locus, locus,…) * | x | x | x | x | x | ||||
| ind (attribute: name) (a3) | indGeogCoord | o (a4) | o (a4) | o (a4) | o (a4) | o | |||
| indLingGroup | o (a5) | o (a5) | o (a5) | o (a5) | o (a5) | ||||
| indNumReads | a1 | a1 | a1 | a1 | a1 | ||||
| indFreq (absolute Freq) | o | o | o | o | o | x | |||
| data (locus, locus,…) | x | x | x | x | x | ||||
| population (attribute: name) aligned, diff data type (a2) | popSize | x | x | x | x | x | |||
| popGeogCoord * | o (a4) | o (a4) | o (a4) | o (a4) | o (a4) | ||||
| popLingGroup * | o (a5) | o (a5) | o (a5) | o (a5) | o (a5) | ||||
| popNumReads * | x (a1) | x (a1) | x (a1) | x (a1) | x (a1) | ||||
| ind (attribute: name) (a3) | indGeogCoord | o (a4) | o (a4) | o (a4) | o (a4) | o | |||
| indLingGroup | o (a5) | o (a5) | o (a5) | o (a5) | o (a5) | ||||
| indLoci (locus, locus,…) | x | x | x | x | x | ||||
| indNumReads | a1 | a1 | a1 | a1 | a1 | ||||
| indFreq (absolute Freq) | o | o | o | o | o | x | |||
| data (locus, locus,…) | x | x | x | x | x | ||||
| population (attribute: name) unaligned (a2) | popSize | x | |||||||
| popGeogCoord * | o (a6) | ||||||||
| popLingGroup * | o (a7) | ||||||||
| popNumReads * | x (a9) | ||||||||
| popLocus * | o (a8) | ||||||||
| ind (attribute: name) | indGeogCoord | o (a6)*3 | |||||||
| indLingGroup | o (a7)*3 | ||||||||
| indLocus *2 | o (a8)*3 | ||||||||
| indFreq (absolute Freq) | o | ||||||||
| indNumReads | a9 | ||||||||
| read | start | x (if unalig.) | |||||||
| length | o | ||||||||
| data | x | ||||||||
| structure (attribute: name) (o) | numGroups | x | x | x | x | x | x | ||
| group (attribute: name) (sample name, sample name, …)*4 | x | x | x | x | x | x | |||
| distanceMatrix (attribute: name) (o) | matrixSize | x | x | x | x | x | x | ||
| matrixLabels (name, name,…) | x | x | x | x | x | x | |||
| matrix (number (line break) number, number (line break)…) | x | x | x | x | x | x | |||
* when for all populations/individuals the same
*2 if it exist –> align DNA over one individual, if it not exist –> align DNA over one population
*3 if for one individual defined, it has to be defined for all individuals
*4 label/sample name within “” if more than one word
x: obligatory
o: optional
a: alternative to
- stylesheet: stylesheet_data-format3.pdf
