====== SOLEXA ====== e-mail: Dear all, On Tue, 11 Dec 2007 11:33:35 +0100, Jean-Louis C. Blouin wrote: > To add in the technical ideas, would it be possible (in the excel > results) to have an indication of the overal quality of sequencing. I > know that we are dealing with hundred of sequences. But for example > having a (or couple) score saying that the substitution that is seen > in 20 out of 400 sequences was of xx score would be very usefull. I have now pretty much completed a first analysis of the 160 exons sequenced through a Solexa machine that Jean-Louis sent me. A few of the analyzes have failed (memory exhausted, and maybe a few other computational problems) but I guess this can be expected from a preliminary analysis. And in the ones that did work, there are probably a few that have questionable results, particularly those that show a high rate of insertions. Here is a sketch of the procedure I used. The input data consists of: - the reference DNA sequence of each exon, usually flanked by 20 nt from the surrounding introns - the roughly 4.5 million Solexa 35 nt reads - the NCBI reference human genome For each exon, the following procedure is applied: - locate the position of the exon in the reference genome using fetchGWI[1] - shred it in overlapping 12-mers, and use the tagger[1] program to select all Solexa reads that contain a matching 12-mer. This gives a collection C of Solexa reads (oriented to match the strand of the exon) - all the reads from C are aligned to the exon using a semi-global alignment program named align0[2] (semi-global, because the end-gaps are not penalizing) - all the reads from C are searched for perfect matches in the reference genome using fetchGWI[1], and create a concatenated sequence from the one that find a perfect match outside of the position of the exon under consideration on the reference genome - all the reads from C are aligned to the concatenated sequence generated in the previous step, and the ones that have a better match on that concatenated sequence than on the exon are discarded - use the remaining global alignments to produce a multiple sequence alignment of the remaining Solexa reads onto the exon, and produce the summary results in an excel compatible format [1] http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1894650 [2] Myers and Miller, CABIOS (1989) 4:11-17 I have now a ~5MB tar file containing the MSA in multipla-FASTA format and the .CSV spreadsheet files of 147 exons. I'm happy to try to send it by email to those that would like to see them (or make it available through http / anon-ftp, but I'm not sure it is appropriate). One thing that is still puzzling me is the average coverage per nucleotide, which varies roughly from 200 to 1500 as can be seen from the attached file. The question is: is this due to the way the sequencing material was generated, or is this some CNV evidence ? Questions welcome... Cheers, Christian