headertext headertext promolink

Lies and big data

Big data and lies it tells usThe other week someone brought to my attention an article with a title “Lies Data Tell Us” by Steven J. Thompson, CEO at Johns Hopkins Medicine International. The title took me aback, but as I read it I realized the article was really about better practices required for data to be more useful. Use of the provocative and somewhat misleading title resulted in nearly 12K views, dozens of comments and hundreds of shares in social media. When I started looking for this article again, the search brought a number of links that associate data, big data, etc. with “lies”. Most of the authors blame data or unscrupulous mining and analysis technology vendors for all sort of business problems resulted from “data lies”.  It seems some of these authors use the following definition:

 Data Scientist (n): A machine for turning data you don’t have into infographics you don’t care about.

I would like to examine a process people often follow when they deal with data.

Since the term “big data” is thrown around a lot, I would like to define it in the context of this article. Mere volume and velocity of data does not constitute “big data”, but multiplicity of data sources and data formats does. From that perspective the term “big data” describes an enterprise data aggregated from multiple departments and multiple data bases (i.e. data warehouse model), linked with data from sources external to a company, in a structured and/or unstructured format. Mining such set of “right data” may produce very valuable intelligence. However, all can also result in waste of money, efforts and opportunities if

  • The mining process does not produce relevant new intelligence, or
  • The intelligence is not used for action.

We act when we believe the action will result in a desirable outcome. We never know for sure, but we estimate probability based on our experiences in similar circumstances. These dynamics influence how we select, search and interpret the data into intelligence, or lack of thereof. Subconsciously we select data that See no evidencewould likely provide confirmation of our existing beliefs. This usually means that we heavily rely on internally generated (controlled) data  and heavily discount externally generated data. 

We like to use such terms as unbiased and objective, but the very process of selecting a data set introduces bias and subjectivity. It is unavoidable. It is a much better practice to embrace and understand a bias that is pragmatic, and define a purpose of an inquiry. You don’t see people mining a mountain to find “whatever” is there. They carefully select and test an area for an indication of high concentration of desired mineral before the exploration and mining start.

If the purpose of your inquiry is improvement of customer experience, assemble a data set from the most relevant internal and external data sources available. If you limit your data set to a company controlled data, you introduce a company bias.  In such a case the likelihood of discovering any new intelligence for improving your customers experience is quite low. Forget about data mining and just continue your archaic surveying exercises of “guess and validate”. If you include data generated by customers without solicitation and control, you will introduce customer bias. Introduction of channel generated return data and customer service data will allow for balancing of the biases. Correlation of trends in controlled and external data sources will help to discover potential gaps between your beliefs and emerging evidence. However, even the best evidence cannot automatically make people abandon their beliefs and start acting differently, but that is a subject of another article.

The point is – data cannot lie to us; we have to do it ourselves by not mining it honestly and competently.

Comments & Thoughts

  1. William Matthies, CEO Coyote Insight, LLC says:

    Not a lie but important nonetheless.

    Advocates of Big Data (including me) need to address the concerns of many regarding the perceived abuse of Big Data. We spend a lot of time talking about what we can do with data, while the public, in general, increasingly including legislators, see it ranging from benign invasion of piracy all the way to identity theft and worse.

    Truth be told it sometimes is that, but nowhere near always. And it sometimes is and could more often be an incredibly beneficial economic growth engine. But not until misconceptions are corrected. Until that happens false perceptions of reality will continue to trump reality.

    Big Data needs a PR makeover.

  2. Holger Kaufmann says:

    Data doesn’t lie, but it might not tell the truth if you don’t know how to ask the right question, or how to ask a question in the right way. Proper data analysis requires some understanding of statistics, causation and correlation. This is why the scientific method was invented. Interpreting data is difficult and many businesses struggle with it.

Leave a Reply

Please note that your email will not be published, and is only used for authentication purposes.

XHTML: You may use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>