Data Data Data

Wednesday, December 24, 2014

Talking Data: Protecting online data privacy was the big 2014 trend

The last Talking Data Podcast of 2014 is a bit of a walk down Twitter Lane with data privacy issues in 2014. Ed Burns and I discuss a Pew poll that looks at the attitude of Americans toward online privacy. They aren't comfortable with their data sharing, but they sure do like that social infrastructure services at gratis. Uber and its loping missteps on the way to killing the hackney as he is now known, and what that means for data professionals. There is more, including some discussion of HortonWorks IPO. On Christmas eve my version was truncated. Might wait until the Christmas smoke clears to sort through that. - Jack Vaughan

Talking Data: Protecting online data privacy was the big2014 trend -MP#

Tuesday, December 23, 2014

That was the year that was big data ala Hadoop, NoSQL

The year 2014 saw progress in big data architecture development and deployment, as users gained more experience with NoSQL alternatives to relational databases, and Hadoop 2 gained traction for operational analytics uses beyond the distributed processing framework's original batch processing role. Those trends were detailed in a variety of SearchDataManagment pieces. Big data in review: 2014

The roots of machine learning

Neural networks and artificial intelligence have been on my mind over many years, while I spent most days studying middleware, something different. At the heart of neurals were backward propagations and feedback loops – all somewhat related to cybernetics, which flowered in the 1940s into the early 1970s. One of the first implements of cybernetics was the thermostat.

In early 2013 I started working at SearchDataManagement, writing about Big Data. At the end of this year I have devoted some time to book learning about machine learning. A lot happened while Rip Van Vaughan was catching some z's. So something told me to go back to one of my old blogs and see where I left off with feedback. If you pick through it you will find Honeywell and the thermostat and automatic pilot, etc. My research told me the first auto pilot arose from a combo of the thermostat (Honeywell) and advanced gyroscopes (Sperry).

I spent hours looking at the thermostat, its mercury, its coil. It had an alchemical effect. I remember wondering if the thermostat could be a surveillance bug. Now we have Nest, which uses the thermostat as a starting point for collecting data for machine learning process. - Jack Vaughan

[It is funny how the old arguments about the autopilot appeared as memes in this year of machine learning. This link, which includes Tom Wolfe's mirthful take on the autopilot in The Right Stuff… is here rather as a place-marker for background: The Secret Museum of Cybernetics - JackVaughan's Radio Weblog, March 2004(also reposted on Moon Traveller with a slew of feedback errata). Probably valuable to cite Nicholas Carr's The Glass Cage, published this year, which takes as its premise, society's growing inabilities, many brought on by automation. Several serious airplane crashes where pilots' skills seemed overly lulled by automation form a showcase in The Glass Cage, ]

From Wolfe’s The Right Stuff:

“Engineers were ... devising systems for guiding rockets into space, through the use of computers built into the engines and connected to accelerometers for monitoring the temperature, pressure, oxygen supply, and other vital conditions of the Mercury capsule and for triggering safety procedures automatically -- meaning they were creating with computers, systems in which machines could communicate with one another, make decisions, take action, all with tremendous speed and accuracy .. Oh, genius engineers! “

Wednesday, December 17, 2014

AI re-emergence : Study to Examine Effects of Artificial Intelligence

Able New York Times technology writer John Markoff (he has been far away the star of my RJ-11 blog) had two of three (count em, three) AI articles in the Dec 16 Times. One discusses Paul Allen's AI2 institute work; the other discusses a study being launched at Stanford with the goal to look at how technology reshapes roles of humans. Dr. Eric Horvitz of MS Research will lead a committee with Russ Altman, a Stanford professor of bioengineering and computer science. The committee will include Barbara J. Grosz, a Harvard University computer scientist; Yoav Shoham, a professor of computer science at Stanford; Tom Mitchell, the chairman of the machine learning department at Carnegie Mellon University; Alan Mackworth, a professor of computer science at the University of British Columbia; Deirdre K. Mulligan, a lawyer and a professor in the School of Information at the University of California, Berkeley. The last, Mulligan, is the only one who immediately with some cursory Googling appears to be ready to accept that there are some potential downsides to AI re-emergence. It looks like Horvitz has an initial thesis formed ahead of the committee work. That is that, based on a TED presentation ("Making friends with AI") , while he understand some people's issues with AI, that the methods of AI will come to support people's decisions in a nurturing way. The theme would be borne out further if we look at the conclusion of an earlier Horvitz'z organized study on AI's ramifications (that advances were largely positive and progress relatively graceful). Let's hope the filters the grop implement tone down the rose-colored learning machine that enforces academics' best hopes. – Jack Vaughan

Sunday, November 9, 2014

Facebook imbroglio

Recently I wrote a story for SearchDataManagement that largely centered on one of this year's big data imbroglios. That is the Facebook-Cornell Emotional Contagion study. This was the topic at Friday night symposium (Oct 8) capping the first day of the Conference on Digital Experimentation at MIT. You could sum up the germ of imp of the story as: it is okay to personalize web pages with financial purpose, and to fine tune your methods thereof, but could that overstep an ethical boundary?

On the MIT panelist but not covered in my brief story was Jonathan Zitrain of Harvard University Law. For me, he put the contagion study into context- contrasting and comparing it to the Stanford Prison experiment and the Tuskeegee prisoner studies, which came under scrutiny and helped lead the way to some standards of decorum for psychological and medical experiments. "There out to be a baseline protection," he said. There is a fiduciary responsibility, a need for a custodial and trusting relationship with subjects, that at least are an objective in science studies of humans.

Now, this responsibility, forwarded by Zitrain and others, remains unstated and vague. Clearly, the Internet powers that be are ready to move on to other topics, and let the Facebook experiment fade into the recesses, as even Snowden's NSA revelations have. I think a professional organization is needed – that sounds old school, I know and I don’t care. As with civil engineering, there is no need for a new generation to figure out what is overstepping – for waiting until the bridge collapses. – Jack Vaughan

Related

http://cyber.law.harvard.edu/people/jzittrain
http://searchdatamanagement.techtarget.com/opinion/Facebook-experiment-points-to-data-ethics-hurdles-in-digital-research
http://codecon.net/

Saturday, October 11, 2014

Calling Dr Data-Dr Null-Dr Data for Evidence-Based Medicine.

"Dr. Data" – likely for SEO reasons it has yet another name in its online version – asks if statistical analysis of databases of historical medical data can be more useful than clinical trial data for diagnosing patients. It recounts the story of Dr. Jennifer Frankovich, a Stanford Children's Hospital rheumatologist, who encountered the young girl symptoms of kidney failure, with possible lupus. Frankovich suspected blood clotting issues, but had to research the matter in order to convince colleagues. Scientific literature comprising clinical trial data did not offer clues. Instead, Frankovich found evidence of connection of clotting and lupus, given certain circumstances, by searching a database of lupus patients that had been to this hospital over the last five years.

The story by Veronique Greenwood tells us she wrote of her experience in a letter to the New England Journal of Medicine, and was subsequently warned by her bosses not to do that kind of query again. Assumedly HIPPA privacy concerns are involved.

It stands to reason that data on all the medical cases in any given hospital could have some good use. It leads me to wonder. Shouldn’t the 'anonymouzation' or masking of individuals' identities within such databanks be a priority? Is the hegemony of the clinical trial era due to ebb, especially when taking into account the momentum of the World Wide Web?

Frankovich's work could come under the aegis of Evidence-Based Medicine. The expanded Web-borne appoach a'brewing here is sometimes called Digital Experimentation. –Jack Vaughan

Saturday, October 4, 2014

Bayesian Blues

Bayesian methods continue to gain attention as a better means to solve problems and predict outcomes. It was Google that used such algorithms to tune its brilliant search engine, and much more. Nate Silver carried the Bayesian chorus along with his depiction of the method in "The Signal and the Noise." Today, in fact, Bayesian thinking is very broadly adapted - enough so for the New York Times to devote a feature entitled "The Odds, Continually Updated" in this week's Science Times section.

Thomas Bayes writer F.T. Flam [I am not making this up] says set out to calculate the probability of God's existence. This was back in the 19th Century in jolly old England. The math was difficult and really beyond the ken of calculastion - until the recent profusion of clustered computer power. Almost a MacGuffin in the narrative is the overboard Long Island fisherman John Aldrich who the Coast Guard found in the Atlantic Ocean to that services use of Bayesian methods.

"The Odds, Continually Updated'' places more import on the possibility that Bayesian statistics have narrowed down the possible correct answers for the age of the earth (from existing estimations that it was 8 B to 15 B years old, to conjectures that it is 13.8 B years old. - Jack Vaugahn

An extended version* of this piece would consider:
Who was Bayes and what is Bayesian math?
How does it compare to frequentist statistics (the main point of the story)? If frequentist methods were aptly applied would they work in most cases?
How does this all relate to the looming question (smaller)of how much we can trust data science and (bigger) how much we can trust science?

*for now this site like most blogs comprises free ruminations - and sometimes you get what you pay for.

Tuesday, September 30, 2014

Data for the social good fellow

Saturday, September 27, 2014

Blue takes on bioinformatics' data errors

Data error correction is something of a guessing game. Cyclic redundancy checks and other advance algorithms seemingly tamed the erroneous tiger at the chip and disk level long ago. Does one have to rethink, revisit and refresh these and other approaches than incremental success is to be found in such applications as gene research that pile up the data but can be prone to garbage in?

As the researchers say, gene sequencing costs have gone down drastically, but accuracy of data has only improved slowly. This looks like a job for better error correction.
Some Microsoft-backed researchers have come up with a bit oddly named ("Blue") software library for such a purpose. It has "proved to be generally more accurate than other published algorithms, resulting in more accurately aligned reads and the assembly of longer contigs containing fewer errors." They write:

One significant feature of Blue is that its k-mer consensus table does not have to be derived from the set of reads being corrected. This decoupling makes it possible to correct one dataset, such as small set of 454 mate-pair reads, with the consensus derived from another dataset, such as Illumina reads derived from the same DNA sample. Such cross-correction can greatly improve the quality of small (and expensive) sets of long reads, leading to even better assemblies and higher quality finished genomes.

In a paper published in Bioinformatics, Paul Greenfield, Research Group Leader, CSIRO, Division Computational Informatics, said test results show that Blue is significantly faster than other available tools—especially on Windows—and is also "more accurate as it recursively evaluates possible alternative corrections in the context of the read being corrected."

The mass of data related to the genome (not just the genome, by any means) continues to call for cutting edge thinking. That thinking may come into play far beyond the old gene pool. - Jack Vaughan

Related
http://blogs.msdn.com/b/msr_er/archive/2014/09/02/a-new-tool-to-correct-dna-sequencing-errors-using-consensus-and-context.aspx

http://www.csiro.au/Outcomes/ICT-and-Services/Software/Blue.aspx

http://bioinformatics.oxfordjournals.org/content/early/2014/06/11/bioinformatics.btu368