Table of Contents

PGD - Population Genetics Data format

Version 1.1

PGD (Population Genetics Data) is a file format designed to contain population genetics data. The aim of this format is to facilitate the transfer among several population genetics software packages. PGD plays an important role in the new data format converter PGDSpider.

PGD is written in XML and is therefore independent of any particular computer system and extensible for future needs. The XML structure can easily be processed by computer programs. An additional XSLT style sheet makes it possible to display the data in an understandable and comprehensive way. This XSLT style sheet is delivered within the PGDSpider download.

The PGDSpider distribution also includes an XML Schema (PGD_schema.xsd), which defines the structure of the PGD file. The purpose of an XML Schema is to define the legal building blocks of an XML document and the allowable contents (W3Schools, 2008). The provided XML Schema can be used to validate a PGD file.

Data type handled

PGD is able to handle following data types:


PGD format description

The PGD format is written in XML (eXtensible Markup Language) and can be created and edited in any text editor (file extension *.xml). An XML document has an ordered, labelled tree structure with following rules:


Heidi Lischer page 23/128 The PGD file format has a block structure and the information’s are saved in a hierarchical way. Therefore the format is very modular and general information can be saved at a higher level than information specific for one individual. This is very convenient because general information’s need to written only once.
A short description of the different blocks can be found below:


Root Element:

The root element named “PGD” encapsulates all other elements of the XML file.


Header block:

The header block contains the general information’s about the data. The tag is named “header” and can contain an attribute named “title=” that defines the title of the data. The header block has the following sub tags:


dataDescription block:

The dataDescription block contains specifications about the different loci. The tag is named “dataDescription” and contains following sub tags:

The <locus> tag has following sub tags:


Population block:

The population block contains information about the population and their individuals with the data. This block could be repeated for multiple times (as many times as there are different populations in the sample). This block is structured differently if the data are aligned or not, and if the data are of the same data type or not. The tag is named “population” and can contain an attribute named “name=” which defines the name of the population. The population block has the following sub tags:

The <ind> tag has following sub tags:

The <read> tag has following sub tags:


Structure block:

The structure block is optional. It contains the information about the structure of the population (grouping). The tag is named “structure” and can contain an attribute named “name=” which defines the name of the structure. The structure block has following sub tags:


DistancMatrix block:

The distanceMatrix block is optional. It contains the information about the genetic distance of the individuals to each other. The tag is named “distanceMatrix” and can contain an attribute named “name=” which defines the name of the distance matrix. The distanceMatrix block has following sub tags:


Schema of the PGD format

Specifications



Schema

data type
block subtags NGS, Microsat, RFLP, SNP, AFLP, standard NGS Allel frequency
header (attribute: title) organism o o o
numPop x x x
ploidy * (–> mixed/1/2/…) x (a4) x (a4)
missing x x
gap o o
gameticPhase (–> known/unknown) o o
recessiveData (–> yes/no) o o
dataDescription (o) numLoci x x
dataType * (–> mixed/DNA/NGS/Microsat/RFLP/AFLP/Standard/Frequency/…) x (a1) x (a1) x
locus (attribute: id) locusDataType (–> DNA/NGS/Microsat/RFLP/AFLP/Standard/Frequency/…) a1 a1
locusChromosome (–> number/X/Y/W/Z/mtDNA/…) o o
locusLocation o o
locusGenic (–> coding/noncoding) o o
locusLength o o
locusAncestralState o o
locusLinks (–> URL) o o
locusComments o o
population (attribute: name) popSize x x x
popGeogCoord * (lon, lat) o (a2) o (a2) o
popLingGroup * o (a3) o (a3) o
popPloidy * (–> mixed/1/2/…) a4 a4
popLoci *3 (locus name, locus name,…) –> all locus of same data type o o
ind (attribute: name) indGeogCoord (lon, lat) o (a2) o (a2)
indLingGroup o (a3) o (a3)
indLoci *4 (locus name, locus name, …) –> all locus of same data type o o
indPloidy (–>1/2/…) a4 a4
indFreq (absolute Freq) o o x
data *5 (locus data, locus data, …) x x
read *6 (attribute: id) start *6 x
length *6 o
data *6 x
quality *6 o
structure (attribute: name) (o) numGroups x x x
group (attribute: name) (pop name, pop name, …) x x x
distanceMatrix (attribute: name) (o) matrixSize x x x
matrixLabels (name, name,…) x x x
matrix (number (line break) number, number (line break)…) x x x

Legend

* if for all populations/individuals the same
*2 if it exist –> align DNA over one individual, if it not exist –> align DNA over one population
*3 data of the same data type (loci) in all individuals
*4 data of different data types (aligned within each locus)
*5 non-NGS data
*6 NGS data (Next Generation Sequencing)

x: obligatory
o: optional
a: alternative to


PGD file examples:


PGD style sheet

The style sheet allows you to display the PGD files in a nicer way:


old versions