Differences

This shows you the differences between two versions of the page.

--- 10.02.08 [2008/06/10 09:15] – created heidi
+++ 10.02.08 [2008/07/22 13:31] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
 ====== e-mail: 10.02.08 ======
+Dear Howard,\\
+Howard Cann wrote:
+> Dear Laurent,
+>
+> Until now, the HGDP-CEPH diversity panel database has stored and
+> displayed marker genotypes generated on the panel population samples.
+> It is time that we receive sequences from panel users who are
+> resequencing in the panel in order to study human variation, estimate
+> diversity indices, describe human demography/history, etc.
+Sounds good!
+> What file formats should we be ready to receive?
+Some flat file format would be good to have, but I think there is no
+general agreement on how these large resequencing files should be
+formatted. I have a MSc student with whom we are beginning to think
+about such a format. We are investigating the possibility to have some
+xml coded file, that would be efficient for resequencing data, but we
+are still in the first development phase.
+> How should we suggest to contributors to code their sequence data (nt
+> letters or numbers), to code missing bases.......?  Etc.
+I guess it would be most useful if sequences would be grouped by
+population, with info on:\\
+Sequence region (chromosomic region)
+Sequence begin
+Sequence length
+Population where it was sequenced
+Individual in which it was sequenced
+Geographic coordinate of the population or of the individual
+Linguistic group or language family of the individual or population
+Tag indicating is sequence phase has been inferred, with pointer to the
+other complementary sequence
+Nucleotide should be coded as ACGT, and ? for missing data (which makes
+intuitive sense), otherwise common letters for ambiguous nucleotide
+assignment
+> What other questions should I be asking you in order to set up a
+> sequence db that will be useful to scientists in the field of human
+> population genetics.
+Some information on whether it is coding sequence or not, with the start
+of the coding region, would be nice to have. Some link to some other
+data base (e.g. ensembl) where additional information can be found would
+be nice as well.
+> I think that CEPH should be concerned with managing and maintaining
+> the sequences in the db and not with computing various parmeters of
+> polymorphism, diversity etc. from them, which most of th panel users
+> are capable of doing.
+Yes, you are right, but some summary statistics could be useful to
+compute.\\
+It would also be nice to be able to extract, say all sequences or
+polymorphism in a given chromosomal region.\\
+Cheers
+laurent