This is an old revision of the document!
Table of Contents
Master thesis meetings
- 21.11.2007: converter: data formats
- 14.12.2007: data formats/universal format, R-lequin: Arlequin output
- 21.01.2008: file format
- 12.02.2008: file format
- 03.03.2008: file format/ converter
- 14.03.2008: Arlequin output
- 17.03.2008: file format
- …
- 06.06.2008: converter
e-mails
10.02.2008:
Dear Howard,
Howard Cann wrote:
Dear Laurent,
Until now, the HGDP-CEPH diversity panel database has stored and
displayed marker genotypes generated on the panel population samples.
It is time that we receive sequences from panel users who are
resequencing in the panel in order to study human variation, estimate
diversity indices, describe human demography/history, etc.
Sounds good!
What file formats should we be ready to receive?
Some flat file format would be good to have, but I think there is no general agreement on how these large resequencing files should be formatted. I have a MSc student with whom we are beginning to think about such a format. We are investigating the possibility to have some xml coded file, that would be efficient for resequencing data, but we are still in the first development phase.
How should we suggest to contributors to code their sequence data (nt
letters or numbers), to code missing bases…….? Etc.
I guess it would be most useful if sequences would be grouped by 
population, with info on:
Sequence region (chromosomic region) Sequence begin Sequence length Population where it was sequenced Individual in which it was sequenced Geographic coordinate of the population or of the individual Linguistic group or language family of the individual or population Tag indicating is sequence phase has been inferred, with pointer to the other complementary sequence Nucleotide should be coded as ACGT, and ? for missing data (which makes intuitive sense), otherwise common letters for ambiguous nucleotide assignment
What other questions should I be asking you in order to set up a
sequence db that will be useful to scientists in the field of human
population genetics.
Some information on whether it is coding sequence or not, with the start of the coding region, would be nice to have. Some link to some other data base (e.g. ensembl) where additional information can be found would be nice as well.
I think that CEPH should be concerned with managing and maintaining
the sequences in the db and not with computing various parmeters of
polymorphism, diversity etc. from them, which most of th panel users
are capable of doing.
Yes, you are right, but some summary statistics could be useful to
compute.
It would also be nice to be able to extract, say all sequences or 
polymorphism in a given chromosomal region.
Cheers laurent
09.06.2008
Hi Heidi,
please have a look at the following paper and program…
http://www.blackwell-synergy.com/links/doi/10.1111/j.1471-8286.2007.02036.x
It would be worth looking at…
cheers laurent
