PGD - Population Genetics Data format

Version 1.1

PGD (Population Genetics Data) is a file format designed to contain population genetics data. The aim of this format is to facilitate the transfer among several population genetics software packages. PGD plays an important role in the new data format converter PGDSpider.

PGD is written in XML and is therefore independent of any particular computer system and extensible for future needs. The XML structure can easily be processed by computer programs. An additional XSLT style sheet makes it possible to display the data in an understandable and comprehensive way. This XSLT style sheet is delivered within the PGDSpider download.

The PGDSpider distribution also includes an XML Schema (PGD_schema.xsd), which defines the structure of the PGD file. The purpose of an XML Schema is to define the legal building blocks of an XML document and the allowable contents (W3Schools, 2008). The provided XML Schema can be used to validate a PGD file.

Data type handled

PGD is able to handle following data types:

DNA
NGS (Next-Generation Sequencing data)
Microsat (coded as number of repeats!)
RFLP
SNP
AFLP
Standard
Frequency (Allele Frequency)
etc.

PGD format description

The PGD format is written in XML (eXtensible Markup Language) and can be created and edited in any text editor (file extension *.xml). An XML document has an ordered, labelled tree structure with following rules:

A XML declaration needs to be included at the beginning of the file: <?xml version=“1.0” encoding=“iso-8859-1”?>
If a style sheet exists, the name of an XSL style sheet reference must be mentioned with the absolute or relative file path to the style sheet after the declaration: <?xml-stylesheet type=“text/xsl” href=“stylesheet_PGD.xsl”?>
A root element is needed. This element is “the parent” of all other elements and includes all other elements. In the PGD file format the root element is named: <PGD>
All XML elements need to have a start and a closing tag and have to be properly nested
XML tags are case sensitive
Attribute values have to be within quotes
The characters “<” and “&” are strictly illegal within text tags. They can be replaced with the expression “<“ (for “<”) and “&” (for “&”).
Comments have to be written within “<!–“ and “–>”: <!– This is a comment –>

Heidi Lischer page 23/128 The PGD file format has a block structure and the information’s are saved in a hierarchical way. Therefore the format is very modular and general information can be saved at a higher level than information specific for one individual. This is very convenient because general information’s need to written only once.
A short description of the different blocks can be found below:

Root Element:

The root element named “PGD” encapsulates all other elements of the XML file.

Header block:

The header block contains the general information’s about the data. The tag is named “header” and can contain an attribute named “title=” that defines the title of the data. The header block has the following sub tags:

<organism> (optional):
- Value: String
- Indicates from which organism the data come from
<numPop> (mandatory):
- Value: Integer
- gives the number of populations listed in the file
<ploidy> (mandatory):
- Value: “mixed” or any Integer
- Specify the ploidy level of the data
- It contains the value “mixed” if the ploidy level is not the same in every population or individual.
<missing> (mandatory):
- Value: Character
- Character which codes missing values
<gap> (optional):
- Value: Character
- Character which codes gaps
<gameticPhase> (optional):
- Value: “known” or “unknown”
- Define if the gametic Phase of the genotypes is known or not
<recessiveData> (optional):
- Value: “no” or “yes”
- Define if genotypic data present a recessive allele

dataDescription block:

The dataDescription block contains specifications about the different loci. The tag is named “dataDescription” and contains following sub tags:

<numLoci> (mandatory):
- Value: Integer
- Gives the number of loci studied
<dataType> (mandatory):
- Value: “mixed”, “DNA”, “NGS”, “Microsat”, “RFLP”, “SNP”, “AFLP”, “Standard”, “Frequency” or etc.
- Defines the data type of the data
- It has the value “mixed” if the data are of different data types
<locus> with attribute “id=” (optional):
- Defines the different loci in the file
- Could exist multiple times (as many as different loci the data have)
- The “id” attribute gives the name of the locus

The <locus> tag has following sub tags:

<locusDataType> (optional):
- Value: “DNA”, “NGS”, “Microsat”, “RFLP”, “SNP”, “AFLP”, “Standard”, “Frequency” or etc.
- Only required if the <lociDataType> tag contains the value “mixed”
- Defines the data type of the locus
<locusChromosome> (optional):
- Value: Integer, “X”, “Y”, “W”, “Z”, “mtDNA” or etc.
- Gives the chromosome the locus come from
<locusLocation> (optional):
- Value: Integer
- Gives the location/position on a chromosome the locus come from
<locusGenic> (optional):
- Value: “coding” or “noncoding”
- Defines if the locus codes for a gene or not
<locusLength> (optional):
- Value: Integer
- Gives the length of the locus in number of bases
<locusAncestralState> (optional):
- Value: String
- Gives the ancestral state of the locus
<locusLinks> (optional):
- Value: String
- Here you can put internet links to locus information
<locusComments> (optional):
- Value: String
- Here you can put comments about the locus

Population block:

The population block contains information about the population and their individuals with the data. This block could be repeated for multiple times (as many times as there are different populations in the sample). This block is structured differently if the data are aligned or not, and if the data are of the same data type or not. The tag is named “population” and can contain an attribute named “name=” which defines the name of the population. The population block has the following sub tags:

<popSize> (mandatory):
- Value: Integer
- Defines the number of individuals in the population
<popGeogCoord> (optional):
- Value: longitude, latitude
- Defines the geographic coordinate of the population
<popLingGroup> (optional):
- Value: String
- Defines the linguistic group which the population belongs to
<popPloidy> (optional):
- Value: “mixed” or any Integer
- Only required if the <ploidy> tag in the header block contains the value “mixed”
- Specify the ploidy level of the data in this population
- It contains the value “mixed” if the ploidy level is different between different individuals.
<popLoci> (optional):
- Value: String, String, …
- If all individuals in this population have the same loci
- Defines the names of the loci in the data for this population, separated by comma
- The loci have to be of the same type
<ind> with attribute “name=” (mandatory):
- Defines the different individuals in this population
- Could exist multiple times (as many different individuals in this population)
- The “name” attribute gives the name of the individual

The <ind> tag has following sub tags:

<indGeogCoord> (optional):
- Value: longitude, latitude
- Defines the geographic coordination of the individual
<indLingGroup> (optional):
- Value: String
- Defines the linguistic group which the individual belongs to
<indLoci> (mandatory, if different data types)
- Value: String, String, …
- Only if the data are of different data types in this population
- Defines the loci names of the data with the same data type in this individual separated by “,”
- The loci must be of the same data type
<indPloidy> (optional):
- Value: Integer
- Only required if the <popPloidy> tag in the population block contains the value “mixed”
- Specify the ploidy level of the data in this individual
<indFreq> (optional, but obligatory if “Frequency” data type)
- Value: Integer
- Defines the absolute frequency of this genotype in the population
<data> (mandatory, if non-NGS data):
- Value: locus data, locus data, …
- Can exist multiple times (as many as different reads in this individual)
- Contains the data of one read of each specified locus (same order as the locus names) separated by a “,”
<read> with attribute “id=” (mandatory, if NGS data (DNA with several reads)):
- Defines the different reads in this individual
- Could exist multiple times (as many as different reads in this individual)

The <read> tag has following sub tags:

<start> (mandatory):
- Value: Integer
- Defines the start point of the sequence
<length> (optional):
- Value: Integer
- Gives the length of the sequence
<data> (mandatory):
- Value: locus data
- Contains the data of one read of the specified locus
<quality> (optional):
- Value: white space separated Integers
- Contains the quality scores of the read

Structure block:

The structure block is optional. It contains the information about the structure of the population (grouping). The tag is named “structure” and can contain an attribute named “name=” which defines the name of the structure. The structure block has following sub tags:

<numGroups> (mandatory):
- Value: Integer
- Defines the number of groups
<group> with attribute “name=” (mandatory):
- Value: String, String, …
- Defines the population which belong to these groups. The population names are separated by “,”
- Could exist multiple times (as many as different groups exist)
- The “name” attribute gives the name of the group

DistancMatrix block:

The distanceMatrix block is optional. It contains the information about the genetic distance of the individuals to each other. The tag is named “distanceMatrix” and can contain an attribute named “name=” which defines the name of the distance matrix. The distanceMatrix block has following sub tags:

<matrixSize> (mandatory):
- Value: Integer
- Defines the number of individuals compared to each other
<matrixLabels> (mandatory):
- Value: String, String, …
- Defines the labels of the distance matrix separated by a “,”
<matrix> (mandatory):
- Value: Integer (line break) Integer, Integer (line break) …
- Gives the genetic distances of each specified individual to each other (same order as in the <matrixLabels> tag
- Data have to be in the lower triangle with diagonals. Lines are separated by a line break and values by a “,”

Schema of the PGD format

Specifications

root element: <PGD>
header/ lcoi block:
- obligatory
- only one per file
population block:
- obligatory
- can exist multiple times
structure/ distanceMatrix block:
- optional
- only one per file

Microsat data are number of repeats
distanceMatrix: lower triangle with diagonal

Schema

				data type
block	subtags			NGS, Microsat, RFLP, SNP, AFLP, standard	NGS	Allel frequency

header (attribute: title)	organism			o	o	o
	numPop			x	x	x
	ploidy * (–> mixed/1/2/…)			x (a4)	x (a4)
	missing			x	x
	gap			o	o
	gameticPhase (–> known/unknown)			o	o
	recessiveData (–> yes/no)			o	o
dataDescription (o)	numLoci			x	x
	dataType * (–> mixed/DNA/NGS/Microsat/RFLP/AFLP/Standard/Frequency/…)			x (a1)	x (a1)	x
	locus (attribute: id)	locusDataType (–> DNA/NGS/Microsat/RFLP/AFLP/Standard/Frequency/…)		a1	a1
		locusChromosome (–> number/X/Y/W/Z/mtDNA/…)		o	o
		locusLocation		o	o
		locusGenic (–> coding/noncoding)		o	o
		locusLength		o	o
		locusAncestralState		o	o
		locusLinks (–> URL)		o	o
		locusComments		o	o
population (attribute: name)	popSize			x	x	x
	popGeogCoord * (lon, lat)			o (a2)	o (a2)	o
	popLingGroup *			o (a3)	o (a3)	o
	popPloidy * (–> mixed/1/2/…)			a4	a4
	popLoci ³ (locus name, locus name,…) –> all locus of same data type*			o	o
	ind (attribute: name)	indGeogCoord (lon, lat)		o (a2)	o (a2)
		indLingGroup		o (a3)	o (a3)
		indLoci ⁴ (locus name, locus name, …) –> all locus of same data type*		o	o
		indPloidy (–>1/2/…)		a4	a4
		indFreq (absolute Freq)		o	o	x
		data ⁵ (locus data, locus data, …)*		x	x
		read *⁶ (attribute: id)	start *⁶		x
			length *⁶		o
			data *⁶		x
			quality *⁶		o
structure (attribute: name) (o)	numGroups			x	x	x
	group (attribute: name) (pop name, pop name, …)			x	x	x
distanceMatrix (attribute: name) (o)	matrixSize			x	x	x
	matrixLabels (name, name,…)			x	x	x
	matrix (number (line break) number, number (line break)…)			x	x	x

Legend

* if for all populations/individuals the same
*² if it exist –> align DNA over one individual, if it not exist –> align DNA over one population
*³ data of the same data type (loci) in all individuals
*⁴ data of different data types (aligned within each locus)
*⁵ non-NGS data
*⁶ NGS data (Next Generation Sequencing)

x: obligatory
o: optional
a: alternative to

PGD file examples:

Data of two loci with Standard data type from four diploid populations: PGD_standard
Data of two loci with different data types (Standard and DNA) from two diploid populations: PGD_diffDataTypes
NGS data of two loci from three haploid populations: PGD_NGS

PGD style sheet

The style sheet allows you to display the PGD files in a nicer way:

stylesheet: stylesheet_pgd.pdf

old versions

old versions

Masterarbeit, Heidi Lischer

Table of Contents