Differences

This shows you the differences between two versions of the page.

--- fastq [2011/09/05 15:12] – heidi
+++ fastq [2011/09/19 16:19] (current) – heidi
@@ Line 10: / Line 10: @@
   * text based
   * no standard file extension, but .fq, .fastq, and .txt are commonly used.
@@ Line 17: / Line 21: @@
   * A FASTQ file normally uses four lines per sequence:
     * Line 1: begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line)
-    * Line 2: is the raw sequence letters
+    * Line 2: is the raw sequence letters (IUPAC ambiguity codes: ACTGNURYSWKMBDHV)
     * Line 3: begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
-    * Line 4: encodes the quality values for the sequence in Line 2 and must contain the same number of symbols as letters in the sequence.
+    * Line 4: encodes the [[http://phd.chnebu.ch/index.php/Next-Generation_Sequencing_data|quality values]] for the sequence in Line 2 and must contain the same number of symbols as letters in the sequence.
+  * The original Sanger FASTQ files also allowed the sequence and quality strings to be wrapped (split over multiple lines), but this is generally discouraged as it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).
+\\
+==== Example: ====
+<code>
+@SEQ_ID
+GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
++
+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
+</code>
+  * FASTQ files from the NCBI/EBI Sequence Read Archive often include a description:
+<code>
+@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
+GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
++SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
+IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
+</code>
 \\