Data Data Data

Tuesday, September 30, 2014

Data for the social good fellow

Saturday, September 27, 2014

Blue takes on bioinformatics' data errors

Data error correction is something of a guessing game. Cyclic redundancy checks and other advance algorithms seemingly tamed the erroneous tiger at the chip and disk level long ago. Does one have to rethink, revisit and refresh these and other approaches than incremental success is to be found in such applications as gene research that pile up the data but can be prone to garbage in?

As the researchers say, gene sequencing costs have gone down drastically, but accuracy of data has only improved slowly. This looks like a job for better error correction.
Some Microsoft-backed researchers have come up with a bit oddly named ("Blue") software library for such a purpose. It has "proved to be generally more accurate than other published algorithms, resulting in more accurately aligned reads and the assembly of longer contigs containing fewer errors." They write:

One significant feature of Blue is that its k-mer consensus table does not have to be derived from the set of reads being corrected. This decoupling makes it possible to correct one dataset, such as small set of 454 mate-pair reads, with the consensus derived from another dataset, such as Illumina reads derived from the same DNA sample. Such cross-correction can greatly improve the quality of small (and expensive) sets of long reads, leading to even better assemblies and higher quality finished genomes.

In a paper published in Bioinformatics, Paul Greenfield, Research Group Leader, CSIRO, Division Computational Informatics, said test results show that Blue is significantly faster than other available tools—especially on Windows—and is also "more accurate as it recursively evaluates possible alternative corrections in the context of the read being corrected."

The mass of data related to the genome (not just the genome, by any means) continues to call for cutting edge thinking. That thinking may come into play far beyond the old gene pool. - Jack Vaughan

Related
http://blogs.msdn.com/b/msr_er/archive/2014/09/02/a-new-tool-to-correct-dna-sequencing-errors-using-consensus-and-context.aspx

http://www.csiro.au/Outcomes/ICT-and-Services/Software/Blue.aspx

http://bioinformatics.oxfordjournals.org/content/early/2014/06/11/bioinformatics.btu368

Sunday, July 27, 2014

Two takes on Mongo

Take 1 - MongoDB arose out of the general 2000s movement that focused on agility in Web application development (with never-ending-changes-to-schemas replacing etched-in-stone do-not-touch-schemas). The proliferation of data formats calls for something like MongoDB, more than a few adventurousdevelopers decided. A document database, it also rides the success of JSON (over XML), and deals with the bare necessities of state management. It also scales quite easily to humongous scale – hence the fanciful name MongoDB. It is a style of data management that is behind big data tide.

Take 2- For my part, I like to think the name MongoDB hails from the good old days of Flash Gordon fighting Ming the Merciless, emperor of the Planet Mongo. My imagination for technology was honed on the Flash serials way back in the Sputnik days, when I'd watch Community Space Theatre Sunday mornings. These spectrally illumined moments are lost to the ages, like all the other signifying TV signals now in the far beyond. But I found a public domain radio days Flash Gordon transmission, and had some fun, mixing it in to a podcast report on my visit in June to MongoDB World in New York for SearchDataManagement. com

Left click to play or right click to download Mongo podcast

Monday, July 14, 2014

Me on Facebook Per Five Labs

Sunday, June 15, 2014

Data stalking, data talking, data tracking

Microsoft Research Visiting Prof Kate Crawford tells the data gatherers that 'everything is personal' What do industry models of privacy assume? she asks. That individuals act like businesses trading info in fictional frictionless market. This is a convenient extension. but what works for the powers that be on one level, works differently for consumers. The convenient extension, thus, is an inconvenient untruth: the fix is in in the big data revolution.As Crawford writes: "Those who wield tools of data tracking and analytics have far more power than those who do not." In a way, the gap between the side arrayed to capitalize on big data and the side that is the data source could not be wider, or more disjointed.

Sunday, June 1, 2014

Will insurance get Googlized?

There is something ineffable about advertising, burdened as it is with psychology. Could that be true for insurance too? The latter has become an industry built on an edifice that is a lattice of perception of risk, but you could say it is built on psychology just as easily (actually, a little more easy to say that). The industry's untapped opportunity is also a potential threat. It is standing there, waiting to be Googlized.

Tuesday, May 13, 2014

Pew data points on NSA surveillance and public

Just caught up with a Pew Research Center/USA TODAY poll conducted in January that estimated overall approval of NSA surveillance had declined since last summer, when stories first broke based on Edward Snowden’s leaked information.... Democrats remain more supportive of the NSA surveillance program than Republicans, though support is down across party lines....While most of the public wants the government to pursue a criminal case against Snowden, young people offer the least support for his prosecution....

http://www.people-press.org/2014/01/20/obamas-nsa-speech-has-little-impact-on-skeptical-public/