From the course: Ethics and Law in Data Analytics

Subjective to objective

From the course: Ethics and Law in Data Analytics

Subjective to objective

- There's an old saying that there are lies, damned lies, and statistics. The idea is that lying with statistics is actually even easier than lying with regular words. Because when many people see numbers, charts, and graphs, their brains just tend to shut down and they accept what they see. This view was popularized with the classic 1954 book called 'How To Lie With Statistics", which has been a best seller ever since. Many of us assume that numbers, "Speak for themselves", and that they give us the unfiltered objective truth, but that is a seriously mistaken assumption. The fact is, it is really easy to misunderstand data even if you're doing your best to be honest. And of course if your goal is to manipulate someone you can really do a lot of mischief. This is true of data of any kind, small or big. Especially when it comes to collection, visualization and interpretation of its meaning. These mistakes happen respectively when data is derived from a sample that is not properly random, that's a collection problem. Or when a graph is presented with misleading proportions, that's a visualization problem. Or when the unit of analysis is overlooked, that's an interpretation problem. We must emphasize that with big data these are still important worries and actually most of them are more complicated now. We will talk about some of them in module three of this course. However, a new kind of subjectivity in data is now possible and that is because data can now be processed by algorithims. There are actually two problems because before data can be processed it must be preprocessed, or "cleaned". We will talk about the issues with processing data in module three since that can lead to systematic bias in society. Data scientists often refer to cleaning data as E-T-L, extraction, transformation and loading. This is because any data set you come across will almost certainly not be able to be uploaded to a program to be processed by an algorithm just the way it is. Many data scientists will actually tell you that this is the most time consuming part of their job. The problem is that it is extremely common for a large data set to have inconsistencies, duplicates, corrupted data, missing data, outliers and entry errors. The data scientist must make many decisions about how to deal with these problems and they will create changes and it is an open question about whether these changes make important changes in the results. In 'The Promise and Peril of Big Data' David Bollier defines data cleaning as "the process of deciding which attributes and variables matter and which can be ignored". And quotes tech CEO Jesper Anderson that cleaning "removes the objectivity from the data itself because it is a very opinionated process of deciding what variables matter." The tendency to turn your brain off when confronted with big data conclusions is exactly the opposite of what you should do. When big data offers you answers it is time to think critically. Two researchers at the University of Washington have focused on the problem of lying with big data and advise asking as many questions as possible before accepting conclusions from big data. Such as 'Who's telling you this?', 'How does it advance their interests?', 'What were the methods used to arrive at the end result?', we will link you to some of their work in the further reading section. Also, in his book 'Naked Statistics', Charles Wheelan gives the same sort of advice, but directed towards the statisticians who present the findings. He cautions that statistical analysis is not like math, which yields a correct answer, but more like detective work. This requires constant honest and humble communication among the detectives and even the best may still end up disagreeing about what the results mean. None of this is to say that big data is a bad way to go. In fact it's often the best way we have to gain important insights, but both producers and consumers of big data must use caution.

Contents