Saturday, September 27, 2014

Blue takes on bioinformatics' data errors

Data error correction is something of a guessing game. Cyclic redundancy checks and other advance algorithms seemingly tamed the erroneous tiger at the chip and disk level long ago. Does one have to rethink, revisit and refresh these and other approaches than incremental success is to be found in such applications as gene research that pile up the data but can be prone to garbage in?

As the researchers say, gene sequencing costs have gone down drastically, but accuracy of data has only improved slowly. This looks like a job for better error correction.
Some Microsoft-backed researchers have come up with a bit oddly named ("Blue") software library for such a purpose. It has "proved to be generally more accurate than other published algorithms, resulting in more accurately aligned reads and the assembly of longer contigs containing fewer errors." They write:

One significant feature of Blue is that its k-mer consensus table does not have to be derived from the set of reads being corrected. This decoupling makes it possible to correct one dataset, such as small set of 454 mate-pair reads, with the consensus derived from another dataset, such as Illumina reads derived from the same DNA sample. Such cross-correction can greatly improve the quality of small (and expensive) sets of long reads, leading to even better assemblies and higher quality finished genomes.

In a paper published in Bioinformatics, Paul Greenfield, Research Group Leader, CSIRO, Division Computational Informatics, said  test results show that Blue is significantly faster than other available tools—especially on Windows—and is also "more accurate as it recursively evaluates possible alternative corrections in the context of the read being corrected."

The mass of data related to the genome (not just the genome, by any means) continues to call for cutting edge thinking. That thinking may come into play far beyond the old gene pool. - Jack Vaughan

Related
http://blogs.msdn.com/b/msr_er/archive/2014/09/02/a-new-tool-to-correct-dna-sequencing-errors-using-consensus-and-context.aspx

http://www.csiro.au/Outcomes/ICT-and-Services/Software/Blue.aspx

http://bioinformatics.oxfordjournals.org/content/early/2014/06/11/bioinformatics.btu368