This is an old revision of the document!

Master thesis meetings

21.11.2007: converter: data formats
14.12.2007: data formats/universal format, R-lequin: Arlequin output
21.01.2008: file format
12.02.2008: file format
03.03.2008: file format/ converter
14.03.2008: Arlequin output
17.03.2008: file format
…
06.06.2008: converter

e-mails

10.02.2008:

Dear Howard,

Howard Cann wrote:
> Dear Laurent,
>
> Until now, the HGDP-CEPH diversity panel database has stored and
> displayed marker genotypes generated on the panel population samples.
It
> is time that we receive sequences from panel users who are 
> resequencing in the panel in order to study human variation, estimate 
> diversity indices, describe human demography/history, etc. 

Sounds good!

> What file formats should we be ready to receive? 

Some flat file format would be good to have, but I think there is no 
general agreement on how these large resequencing files should be 
formatted. I have a MSc student with whom we are beginning to think 
about such a format. We are investigating the possibility to have some 
xml coded file, that would be efficient for resequencing data, but we 
are still in the first development phase.
> How should we suggest to contributors to code their sequence data (nt 
> letters or numbers), to code missing bases.......?  Etc. 
I guess it would be most useful if sequences would be grouped by 
population, with info on:

Sequence region (chromosomic region)
Sequence begin
Sequence length
Population where it was sequenced
Individual in which it was sequenced
Geographic coordinate of the population or of the individual
Linguistic group or language family of the individual or population
Tag indicating is sequence phase has been inferred, with pointer to the 
other complementary sequence
Nucleotide should be coded as ACGT, and ? for missing data (which makes 
intuitive sense), otherwise common letters for ambiguous nucleotide 
assignment

> What other questions should I be asking you in order to set up a 
> sequence db that will be useful to scientists in the field of human 
> population genetics. 
Some information on whether it is coding sequence or not, with the start

of the coding region, would be nice to have. Some link to some other 
data base (e.g. ensembl) where additional information can be found would

be nice as well.
> I think that CEPH should be concerned with managing and maintaining 
> the sequences in the db and not with computing various parmeters of 
> polymorphism, diversity etc. from them, which most of th panel users 
> are capable of doing.
Yes, you are right, but some summary statistics could be useful to
compute.

It would also be nice to be able to extract, say all sequences or 
polymorphism in a given chromosomal region.

Cheers

laurent

09.06.2008:

Hi Heidi,

please have a look at the following paper and program...

http://www.blackwell-synergy.com/links/doi/10.1111/j.1471-8286.2007.02036.x

It would be worth looking at...

cheers
laurent