10.02.08
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
10.02.08 [2008/06/10 09:15] – created heidi | 10.02.08 [2008/07/22 13:31] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== e-mail: 10.02.08 ====== | ====== e-mail: 10.02.08 ====== | ||
+ | Dear Howard,\\ | ||
+ | |||
+ | Howard Cann wrote: | ||
+ | > Dear Laurent, | ||
+ | > | ||
+ | > Until now, the HGDP-CEPH diversity panel database has stored and | ||
+ | > displayed marker genotypes generated on the panel population samples. | ||
+ | > It is time that we receive sequences from panel users who are | ||
+ | > resequencing in the panel in order to study human variation, estimate | ||
+ | > diversity indices, describe human demography/ | ||
+ | |||
+ | Sounds good! | ||
+ | |||
+ | > What file formats should we be ready to receive? | ||
+ | |||
+ | Some flat file format would be good to have, but I think there is no | ||
+ | general agreement on how these large resequencing files should be | ||
+ | formatted. I have a MSc student with whom we are beginning to think | ||
+ | about such a format. We are investigating the possibility to have some | ||
+ | xml coded file, that would be efficient for resequencing data, but we | ||
+ | are still in the first development phase. | ||
+ | > How should we suggest to contributors to code their sequence data (nt | ||
+ | > letters or numbers), to code missing bases.......? | ||
+ | I guess it would be most useful if sequences would be grouped by | ||
+ | population, with info on:\\ | ||
+ | |||
+ | Sequence region (chromosomic region) | ||
+ | Sequence begin | ||
+ | Sequence length | ||
+ | Population where it was sequenced | ||
+ | Individual in which it was sequenced | ||
+ | Geographic coordinate of the population or of the individual | ||
+ | Linguistic group or language family of the individual or population | ||
+ | Tag indicating is sequence phase has been inferred, with pointer to the | ||
+ | other complementary sequence | ||
+ | Nucleotide should be coded as ACGT, and ? for missing data (which makes | ||
+ | intuitive sense), otherwise common letters for ambiguous nucleotide | ||
+ | assignment | ||
+ | |||
+ | > What other questions should I be asking you in order to set up a | ||
+ | > sequence db that will be useful to scientists in the field of human | ||
+ | > population genetics. | ||
+ | |||
+ | Some information on whether it is coding sequence or not, with the start | ||
+ | of the coding region, would be nice to have. Some link to some other | ||
+ | data base (e.g. ensembl) where additional information can be found would | ||
+ | be nice as well. | ||
+ | |||
+ | > I think that CEPH should be concerned with managing and maintaining | ||
+ | > the sequences in the db and not with computing various parmeters of | ||
+ | > polymorphism, | ||
+ | > are capable of doing. | ||
+ | |||
+ | Yes, you are right, but some summary statistics could be useful to | ||
+ | compute.\\ | ||
+ | |||
+ | It would also be nice to be able to extract, say all sequences or | ||
+ | polymorphism in a given chromosomal region.\\ | ||
+ | |||
+ | Cheers | ||
+ | laurent | ||
10.02.08.1213082158.txt.gz · Last modified: 2008/07/22 13:29 (external edit)