Data Data Data

Sunday, November 9, 2014

Facebook imbroglio

Recently I wrote a story for SearchDataManagement that largely centered on one of this year's big data imbroglios. That is the Facebook-Cornell Emotional Contagion study. This was the topic at Friday night symposium (Oct 8) capping the first day of the Conference on Digital Experimentation at MIT. You could sum up the germ of imp of the story as: it is okay to personalize web pages with financial purpose, and to fine tune your methods thereof, but could that overstep an ethical boundary?

On the MIT panelist but not covered in my brief story was Jonathan Zitrain of Harvard University Law. For me, he put the contagion study into context- contrasting and comparing it to the Stanford Prison experiment and the Tuskeegee prisoner studies, which came under scrutiny and helped lead the way to some standards of decorum for psychological and medical experiments. "There out to be a baseline protection," he said. There is a fiduciary responsibility, a need for a custodial and trusting relationship with subjects, that at least are an objective in science studies of humans.

Now, this responsibility, forwarded by Zitrain and others, remains unstated and vague. Clearly, the Internet powers that be are ready to move on to other topics, and let the Facebook experiment fade into the recesses, as even Snowden's NSA revelations have. I think a professional organization is needed – that sounds old school, I know and I don’t care. As with civil engineering, there is no need for a new generation to figure out what is overstepping – for waiting until the bridge collapses. – Jack Vaughan

Related

http://cyber.law.harvard.edu/people/jzittrain
http://searchdatamanagement.techtarget.com/opinion/Facebook-experiment-points-to-data-ethics-hurdles-in-digital-research
http://codecon.net/

Saturday, October 11, 2014

Calling Dr Data-Dr Null-Dr Data for Evidence-Based Medicine.

"Dr. Data" – likely for SEO reasons it has yet another name in its online version – asks if statistical analysis of databases of historical medical data can be more useful than clinical trial data for diagnosing patients. It recounts the story of Dr. Jennifer Frankovich, a Stanford Children's Hospital rheumatologist, who encountered the young girl symptoms of kidney failure, with possible lupus. Frankovich suspected blood clotting issues, but had to research the matter in order to convince colleagues. Scientific literature comprising clinical trial data did not offer clues. Instead, Frankovich found evidence of connection of clotting and lupus, given certain circumstances, by searching a database of lupus patients that had been to this hospital over the last five years.

The story by Veronique Greenwood tells us she wrote of her experience in a letter to the New England Journal of Medicine, and was subsequently warned by her bosses not to do that kind of query again. Assumedly HIPPA privacy concerns are involved.

It stands to reason that data on all the medical cases in any given hospital could have some good use. It leads me to wonder. Shouldn’t the 'anonymouzation' or masking of individuals' identities within such databanks be a priority? Is the hegemony of the clinical trial era due to ebb, especially when taking into account the momentum of the World Wide Web?

Frankovich's work could come under the aegis of Evidence-Based Medicine. The expanded Web-borne appoach a'brewing here is sometimes called Digital Experimentation. –Jack Vaughan

Saturday, October 4, 2014

Bayesian Blues

Bayesian methods continue to gain attention as a better means to solve problems and predict outcomes. It was Google that used such algorithms to tune its brilliant search engine, and much more. Nate Silver carried the Bayesian chorus along with his depiction of the method in "The Signal and the Noise." Today, in fact, Bayesian thinking is very broadly adapted - enough so for the New York Times to devote a feature entitled "The Odds, Continually Updated" in this week's Science Times section.

Thomas Bayes writer F.T. Flam [I am not making this up] says set out to calculate the probability of God's existence. This was back in the 19th Century in jolly old England. The math was difficult and really beyond the ken of calculastion - until the recent profusion of clustered computer power. Almost a MacGuffin in the narrative is the overboard Long Island fisherman John Aldrich who the Coast Guard found in the Atlantic Ocean to that services use of Bayesian methods.

"The Odds, Continually Updated'' places more import on the possibility that Bayesian statistics have narrowed down the possible correct answers for the age of the earth (from existing estimations that it was 8 B to 15 B years old, to conjectures that it is 13.8 B years old. - Jack Vaugahn

An extended version* of this piece would consider:
Who was Bayes and what is Bayesian math?
How does it compare to frequentist statistics (the main point of the story)? If frequentist methods were aptly applied would they work in most cases?
How does this all relate to the looming question (smaller)of how much we can trust data science and (bigger) how much we can trust science?

*for now this site like most blogs comprises free ruminations - and sometimes you get what you pay for.

Tuesday, September 30, 2014

Data for the social good fellow

Saturday, September 27, 2014

Blue takes on bioinformatics' data errors

Data error correction is something of a guessing game. Cyclic redundancy checks and other advance algorithms seemingly tamed the erroneous tiger at the chip and disk level long ago. Does one have to rethink, revisit and refresh these and other approaches than incremental success is to be found in such applications as gene research that pile up the data but can be prone to garbage in?

As the researchers say, gene sequencing costs have gone down drastically, but accuracy of data has only improved slowly. This looks like a job for better error correction.
Some Microsoft-backed researchers have come up with a bit oddly named ("Blue") software library for such a purpose. It has "proved to be generally more accurate than other published algorithms, resulting in more accurately aligned reads and the assembly of longer contigs containing fewer errors." They write:

One significant feature of Blue is that its k-mer consensus table does not have to be derived from the set of reads being corrected. This decoupling makes it possible to correct one dataset, such as small set of 454 mate-pair reads, with the consensus derived from another dataset, such as Illumina reads derived from the same DNA sample. Such cross-correction can greatly improve the quality of small (and expensive) sets of long reads, leading to even better assemblies and higher quality finished genomes.

In a paper published in Bioinformatics, Paul Greenfield, Research Group Leader, CSIRO, Division Computational Informatics, said test results show that Blue is significantly faster than other available tools—especially on Windows—and is also "more accurate as it recursively evaluates possible alternative corrections in the context of the read being corrected."

The mass of data related to the genome (not just the genome, by any means) continues to call for cutting edge thinking. That thinking may come into play far beyond the old gene pool. - Jack Vaughan

Related
http://blogs.msdn.com/b/msr_er/archive/2014/09/02/a-new-tool-to-correct-dna-sequencing-errors-using-consensus-and-context.aspx

http://www.csiro.au/Outcomes/ICT-and-Services/Software/Blue.aspx

http://bioinformatics.oxfordjournals.org/content/early/2014/06/11/bioinformatics.btu368

Sunday, July 27, 2014

Two takes on Mongo

Take 1 - MongoDB arose out of the general 2000s movement that focused on agility in Web application development (with never-ending-changes-to-schemas replacing etched-in-stone do-not-touch-schemas). The proliferation of data formats calls for something like MongoDB, more than a few adventurousdevelopers decided. A document database, it also rides the success of JSON (over XML), and deals with the bare necessities of state management. It also scales quite easily to humongous scale – hence the fanciful name MongoDB. It is a style of data management that is behind big data tide.

Take 2- For my part, I like to think the name MongoDB hails from the good old days of Flash Gordon fighting Ming the Merciless, emperor of the Planet Mongo. My imagination for technology was honed on the Flash serials way back in the Sputnik days, when I'd watch Community Space Theatre Sunday mornings. These spectrally illumined moments are lost to the ages, like all the other signifying TV signals now in the far beyond. But I found a public domain radio days Flash Gordon transmission, and had some fun, mixing it in to a podcast report on my visit in June to MongoDB World in New York for SearchDataManagement. com

Left click to play or right click to download Mongo podcast

Monday, July 14, 2014

Me on Facebook Per Five Labs

Sunday, June 15, 2014

Data stalking, data talking, data tracking

Microsoft Research Visiting Prof Kate Crawford tells the data gatherers that 'everything is personal' What do industry models of privacy assume? she asks. That individuals act like businesses trading info in fictional frictionless market. This is a convenient extension. but what works for the powers that be on one level, works differently for consumers. The convenient extension, thus, is an inconvenient untruth: the fix is in in the big data revolution.As Crawford writes: "Those who wield tools of data tracking and analytics have far more power than those who do not." In a way, the gap between the side arrayed to capitalize on big data and the side that is the data source could not be wider, or more disjointed.

Sunday, June 1, 2014

Will insurance get Googlized?

There is something ineffable about advertising, burdened as it is with psychology. Could that be true for insurance too? The latter has become an industry built on an edifice that is a lattice of perception of risk, but you could say it is built on psychology just as easily (actually, a little more easy to say that). The industry's untapped opportunity is also a potential threat. It is standing there, waiting to be Googlized.