SOLEXA

e-mail:
Dear all,

On Tue, 11 Dec 2007 11:33:35 +0100, Jean-Louis C. Blouin wrote:
> To add in the technical ideas, would it be possible (in the excel 
> results) to have an indication of the overal quality of sequencing. I 
> know that we are dealing with hundred of sequences. But for example 
> having a (or couple) score saying that  the substitution that is seen 
> in 20 out of 400 sequences was of xx score would be very usefull.

I have now pretty much completed a first analysis of the 160 exons
sequenced through a Solexa machine that Jean-Louis sent me.

A few of the analyzes have failed (memory exhausted, and maybe a few
other computational problems) but I guess this can be expected from a
preliminary analysis.  And in the ones that did work, there are
probably a few that have questionable results, particularly those that
show a high rate of insertions.

Here is a sketch of the procedure I used.  The input data consists of:
- the reference DNA sequence of each exon, usually flanked by 20 nt 
from the surrounding introns
- the roughly 4.5 million Solexa 35 nt reads
- the NCBI reference human genome

For each exon, the following procedure is applied:
- locate the position of the exon in the reference genome using
fetchGWI[1]
- shred it in overlapping 12-mers, and use the tagger[1] program to
select all Solexa reads that contain a matching 12-mer.  This gives a
collection C of Solexa reads (oriented to match the strand of the exon)
- all the reads from C are aligned to the exon using a semi-global
alignment program named align0[2] (semi-global, because the end-gaps
are not penalizing)
- all the reads from C are searched for perfect matches in the
reference genome using fetchGWI[1], and create a concatenated sequence 
from the one that find a perfect match outside of the position of the
exon under consideration on the reference genome
- all the reads from C are aligned to the concatenated sequence
generated in the previous step, and the ones that have a better match
on that concatenated sequence than on the exon are discarded
- use the remaining global alignments to produce a multiple sequence
alignment of the remaining Solexa reads onto the exon, and produce the
summary results in an excel compatible format

[1] http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1894650
[2] Myers and Miller, CABIOS (1989) 4:11-17

I have now a ~5MB tar file containing the MSA in multipla-FASTA format
and the .CSV spreadsheet files of 147 exons.  I'm happy to try to send
it by email to those that would like to see them (or make it available
through http / anon-ftp, but I'm not sure it is appropriate).

One thing that is still puzzling me is the average coverage per
nucleotide, which varies roughly from 200 to 1500 as can be seen from
the attached file.  The question is: is this due to the way the
sequencing material was generated, or is this some CNV evidence ?

Questions welcome...

Cheers,
                    Christian