Preprocessing and Storing High-Throughput Sequencing Data
Świercz Aleksandra 1,2,*, Bosak Bartosz 3, Chłopkowski Marek 1, Hoffa Arkadiusz 1, Kasprzak Marta 1,2, Kurowski Krzysztof 3, Piontek Tomasz 3, Błażewicz Jacek 1,2
1 Institute of Computing Science
Poznan University of Technology, ul. Piotrowo 2, Pozna ́ n, Poland2 Institute of Bioorganic Chemistry
Polish Academy of Sciences, ul. Noskowskiego 12/14, Pozna ́ n, Poland3 Poznań Supercomputing and Networking Center
ul. Noskowskiego 10, Pozna ́ n, Poland∗E-mail: aswiercz@cs.put.poznan.pl
DOI: 10.12921/cmst.2014.20.01.9-20
Abstract:
DNA sequencing is a process of recognizing DNA sequences of genomes. The process consists in reading short sequences, that are subsequences of a genome, and merging them into longer sequences, preferably the whole genome. In the first phase even billions of short sequences are read at once. To simplify and speed up the second phase, we develop a pipeline of preprocessing the initial set of short sequences that is removing low quality reads and duplicated reads. We also propose a method for preliminary joining overlapping sequences, which resulted in decreasing the cardinality of initial sets to 13.9% and 27.8%. We also examine possible ways to store the huge amount of experimental data. We compare different compression methods, from which the best appeared to be DSRC, developed for special type of text files containing short sequences and their quality. We test the parameters for TCP data transferring to find the best transfer rate.
Key words:
compression, DNA sequencing, preprocessing, transfer optimization
References:
[1] T. Jiang, M. Li, DNA sequencing and string learning, Math-
ematical Systems Theory 29, 387-405 (1996).
[2] J.C. Dohm, C. Lottaz, T. Borodina, H. Himmelbauer,
SHARCGS, a fast and highly accurate short-read assembly
algorithm for de novo genomic sequencing, Genome Res 17,
1697-1706 (2007).
[3] D.W. Bryant, W.K. Wong Jr., T.C. Mockler, QSRA: a qua-
lity-value guided de novo short read assembler, BMC Bioin-
formatics 10, 69 (2009).
[4] R.L. Warren, G.G. Sutton, S.J. Jones, R.A. Holt, Assembling
millions of short DNA sequences using SSAKE, Bioinfor-
matics 23, 500-501 (2007).
[5] R. Li, W. Fan, G. Tian, H. Zhu, L. He, et al., The sequence
and de novo assembly of the giant panda genome. Nature
463, 311-317 (2010).
[6] J.T. Simpson, K. Wong, S.D. Jackman, J.E. Schein,
S.J. Jones, et al., ABySS: a parallel assembler for short read
sequence data, Genome Res 19, 1117-1123 (2009).
[7] Li R, Zhu H, Ruan J, Qian W, Fang X, et al., De novo as-
sembly of human genomes with massively parallel short read
sequencing, Genome Res 20, 265-272 (2010).
[8] D.R. Zerbino, E. Birney, Velvet: algorithms for de novo short
read assembly using de Bruijn graphs, Genome Res 18, 821-
829 (2008).
[9] J. Blazewicz, M. Figlerowicz, P. Gawron, M. Kasprzak,
E. Kirton, D. Platt, A. Swiercz, L. Szajkowski, Whole
genome assembly from 454 sequencing output via modified
DNA graph concept, Computational Biology and Chemistry
33, 224-230 (2009).
[10] T.F. Smith, M.S. Waterman, Identification of common mole-
cular subsequences, Journal of Molecular Biology 147, 195-
197 (1981).
[11] http://www.bioinformatics.nl/tools/crab_fasta.html.
[12] P.J.A. Cock, C.J. Fields, N. Goto, M.L. Heuer, P.M. Rice.
The Sanger FASTQ file format for sequences with quality
scores, and the Solexa/Illumina FASTQ variants, Nucleic
Acids Res 38, 1767-1771 (2010).
[13] D. Merrill, A. Grimshaw, Revisiting sorting for GPGPU
stream architectures, Technical Report CS2010-03, Univer-
sity of Virginia, Department of Computer Science, Char-
lottesville, VA, USA, (2010).
[14] Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra).
[15] Escherichia coli str. K-12 substr. MG1655, complete
genome (http://www.ncbi.nlm.nih.gov/nuccore/
NC_000913.2).
[16] WormBase, ftp://ftp.wormbase.org/pub/wormbase/releases/
WS230/species/c_elegans/ c_elegans.WS230.genomic.fa.gz.
[17] K. Yook, T.W. Harris, T. Bieri, et al., WormBase 2012: More
genomes, more data, new website, Nucleic Acids Research
40, D735-D741 (2012).
[18] B. Langmead, S. Salzberg, Fast gapped-read alignment with
Bowtie 2, Nature Methods 9, 357-359 (2012).
[19] L. Ilie, F. Fazayeli, S. Ilie, HiTEC: accurate error correc-
tion in high-throughput sequencing data, Bioinformatics 27,
295-302 (2011).
[20] W.C. Kao, A.H. Chan, Y.S. Song, ECHO: a reference-
free short-read error correction algorithm, Genome Res 21,
1181-1192 (2011).
[21] J.T. Simpson, R. Durbin, Efficient de novo assembly of large
genomes using compressed data structures, Genome Res 22,
549-556 (2012).
[22] J. Ziv, A. Lempel, A Universal Algorithm for Sequential
Data Compression, IEEE Transactions on Information The-
ory 23, 337-343 (1977).
[23] J. Ziv, A. Lempel, Compression of individual sequences
via variable rate coding, IEEE Transactions on Information
Theory 24, 530-535 (1978).
[24] Deflate and inflate algorithms (http://www.gzip.org/
algorithm.txt ).
[25] J-L Gailly, GNU gzip, 2013 (http://www.gnu.org/software/
gzip/manual/gzip.pdf).
[26] P. Deutsch, J-L. Gailly, ZLIB Compressed Data Format
Specification version 3.3, 1996
(http://tools.ietf.org/pdf/rfc1950.pdf).
[27] P. Deutsch, DEFLATE Compressed Data Format Specifica-
tion version 1.3, 1996 (http://tools.ietf.org/pdf/rfc1951.pdf).
[28] P. Deutsch, GZIP file format specification version 4.3, 1996
(http://tools.ietf.org/pdf/rfc1952.pdf).
[29] M. Burrows, D.J. Wheeler, A block sorting lossless data
compression algorithm, Technical Report 124, Digital
Equipment Corporation. (1994).
[30] D.A. Huffman, A Method for the Construction of Minimum-
Redundancy Codes, Proceedings of the I.R.E., 1098-1102
(1952).
[31] 7-zip file archiver (http://www.7-zip.org/).
[32] S. Deorowicz, S. Grabowski, Compression of DNA sequence
reads in FASTQ format, Bioinformatics 27, 860-862 (2011).
[33] E. Grassi, F.D. Gregorio, I Molineris. KungFQ: A Simple
and Powerful Approach to Compress fastq Files, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
9, pp. 1837-1842 (2012).
[34] M. Brzezniak, N. Meyer, R. Mikołajczak, G. Jankowski, M.
Jankowski. Popular Backup/Archival Service and its Appli-
cation for the Archival of the Network Traffic in the PI-
ONIER Academic Network, CMST Special Issue, 109-118
(2010).
[35] W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Du-
mitrescu, I. Raicu, I. Foster, The Globus Striped GridFTP
Framework and Server, SC’05, ACM Press, (2005).
[36] J. Semke, M. Jamshid, M. Matthew, Automatic TCP buffer
tuning. ACM SIGCOMM Computer Communication Re-
view 28.4, 315-323 (1998).
[37] Enabling High Performance Data Transfers, at Pittsburg Su-
percomputing Center, (http://www.psc.edu/index.php/
networking/641-tcp-tune#Linux).
[38] GT 5.0.2 GridFTP (http://www.globus.org/toolkit/docs/
5.0/5.0.2/data/gridftp/).
DNA sequencing is a process of recognizing DNA sequences of genomes. The process consists in reading short sequences, that are subsequences of a genome, and merging them into longer sequences, preferably the whole genome. In the first phase even billions of short sequences are read at once. To simplify and speed up the second phase, we develop a pipeline of preprocessing the initial set of short sequences that is removing low quality reads and duplicated reads. We also propose a method for preliminary joining overlapping sequences, which resulted in decreasing the cardinality of initial sets to 13.9% and 27.8%. We also examine possible ways to store the huge amount of experimental data. We compare different compression methods, from which the best appeared to be DSRC, developed for special type of text files containing short sequences and their quality. We test the parameters for TCP data transferring to find the best transfer rate.
Key words:
compression, DNA sequencing, preprocessing, transfer optimization
References:
[1] T. Jiang, M. Li, DNA sequencing and string learning, Math-
ematical Systems Theory 29, 387-405 (1996).
[2] J.C. Dohm, C. Lottaz, T. Borodina, H. Himmelbauer,
SHARCGS, a fast and highly accurate short-read assembly
algorithm for de novo genomic sequencing, Genome Res 17,
1697-1706 (2007).
[3] D.W. Bryant, W.K. Wong Jr., T.C. Mockler, QSRA: a qua-
lity-value guided de novo short read assembler, BMC Bioin-
formatics 10, 69 (2009).
[4] R.L. Warren, G.G. Sutton, S.J. Jones, R.A. Holt, Assembling
millions of short DNA sequences using SSAKE, Bioinfor-
matics 23, 500-501 (2007).
[5] R. Li, W. Fan, G. Tian, H. Zhu, L. He, et al., The sequence
and de novo assembly of the giant panda genome. Nature
463, 311-317 (2010).
[6] J.T. Simpson, K. Wong, S.D. Jackman, J.E. Schein,
S.J. Jones, et al., ABySS: a parallel assembler for short read
sequence data, Genome Res 19, 1117-1123 (2009).
[7] Li R, Zhu H, Ruan J, Qian W, Fang X, et al., De novo as-
sembly of human genomes with massively parallel short read
sequencing, Genome Res 20, 265-272 (2010).
[8] D.R. Zerbino, E. Birney, Velvet: algorithms for de novo short
read assembly using de Bruijn graphs, Genome Res 18, 821-
829 (2008).
[9] J. Blazewicz, M. Figlerowicz, P. Gawron, M. Kasprzak,
E. Kirton, D. Platt, A. Swiercz, L. Szajkowski, Whole
genome assembly from 454 sequencing output via modified
DNA graph concept, Computational Biology and Chemistry
33, 224-230 (2009).
[10] T.F. Smith, M.S. Waterman, Identification of common mole-
cular subsequences, Journal of Molecular Biology 147, 195-
197 (1981).
[11] http://www.bioinformatics.nl/tools/crab_fasta.html.
[12] P.J.A. Cock, C.J. Fields, N. Goto, M.L. Heuer, P.M. Rice.
The Sanger FASTQ file format for sequences with quality
scores, and the Solexa/Illumina FASTQ variants, Nucleic
Acids Res 38, 1767-1771 (2010).
[13] D. Merrill, A. Grimshaw, Revisiting sorting for GPGPU
stream architectures, Technical Report CS2010-03, Univer-
sity of Virginia, Department of Computer Science, Char-
lottesville, VA, USA, (2010).
[14] Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra).
[15] Escherichia coli str. K-12 substr. MG1655, complete
genome (http://www.ncbi.nlm.nih.gov/nuccore/
NC_000913.2).
[16] WormBase, ftp://ftp.wormbase.org/pub/wormbase/releases/
WS230/species/c_elegans/ c_elegans.WS230.genomic.fa.gz.
[17] K. Yook, T.W. Harris, T. Bieri, et al., WormBase 2012: More
genomes, more data, new website, Nucleic Acids Research
40, D735-D741 (2012).
[18] B. Langmead, S. Salzberg, Fast gapped-read alignment with
Bowtie 2, Nature Methods 9, 357-359 (2012).
[19] L. Ilie, F. Fazayeli, S. Ilie, HiTEC: accurate error correc-
tion in high-throughput sequencing data, Bioinformatics 27,
295-302 (2011).
[20] W.C. Kao, A.H. Chan, Y.S. Song, ECHO: a reference-
free short-read error correction algorithm, Genome Res 21,
1181-1192 (2011).
[21] J.T. Simpson, R. Durbin, Efficient de novo assembly of large
genomes using compressed data structures, Genome Res 22,
549-556 (2012).
[22] J. Ziv, A. Lempel, A Universal Algorithm for Sequential
Data Compression, IEEE Transactions on Information The-
ory 23, 337-343 (1977).
[23] J. Ziv, A. Lempel, Compression of individual sequences
via variable rate coding, IEEE Transactions on Information
Theory 24, 530-535 (1978).
[24] Deflate and inflate algorithms (http://www.gzip.org/
algorithm.txt ).
[25] J-L Gailly, GNU gzip, 2013 (http://www.gnu.org/software/
gzip/manual/gzip.pdf).
[26] P. Deutsch, J-L. Gailly, ZLIB Compressed Data Format
Specification version 3.3, 1996
(http://tools.ietf.org/pdf/rfc1950.pdf).
[27] P. Deutsch, DEFLATE Compressed Data Format Specifica-
tion version 1.3, 1996 (http://tools.ietf.org/pdf/rfc1951.pdf).
[28] P. Deutsch, GZIP file format specification version 4.3, 1996
(http://tools.ietf.org/pdf/rfc1952.pdf).
[29] M. Burrows, D.J. Wheeler, A block sorting lossless data
compression algorithm, Technical Report 124, Digital
Equipment Corporation. (1994).
[30] D.A. Huffman, A Method for the Construction of Minimum-
Redundancy Codes, Proceedings of the I.R.E., 1098-1102
(1952).
[31] 7-zip file archiver (http://www.7-zip.org/).
[32] S. Deorowicz, S. Grabowski, Compression of DNA sequence
reads in FASTQ format, Bioinformatics 27, 860-862 (2011).
[33] E. Grassi, F.D. Gregorio, I Molineris. KungFQ: A Simple
and Powerful Approach to Compress fastq Files, IEEE/ACM
Transactions on Computational Biology and Bioinformatics
9, pp. 1837-1842 (2012).
[34] M. Brzezniak, N. Meyer, R. Mikołajczak, G. Jankowski, M.
Jankowski. Popular Backup/Archival Service and its Appli-
cation for the Archival of the Network Traffic in the PI-
ONIER Academic Network, CMST Special Issue, 109-118
(2010).
[35] W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Du-
mitrescu, I. Raicu, I. Foster, The Globus Striped GridFTP
Framework and Server, SC’05, ACM Press, (2005).
[36] J. Semke, M. Jamshid, M. Matthew, Automatic TCP buffer
tuning. ACM SIGCOMM Computer Communication Re-
view 28.4, 315-323 (1998).
[37] Enabling High Performance Data Transfers, at Pittsburg Su-
percomputing Center, (http://www.psc.edu/index.php/
networking/641-tcp-tune#Linux).
[38] GT 5.0.2 GridFTP (http://www.globus.org/toolkit/docs/
5.0/5.0.2/data/gridftp/).