How to Lie With Data

I don’t trust data.

Whenever I see a pretty chart, an average, a line going up and to the right, my first instinct is to question what I’m seeing. How was the data collected? What’s the time period? What are the units? Why are the axes like that? What is the story they are trying to tell?

Any piece of data (1) must be examined in this way, otherwise you get fooled.

Inputs

How?

Well, to start with, the data-generating mechanism can be faulty. Guess what – this means that any conclusions you draw from the data may also be faulty. Garbage in, garbage out.

This is a particularly acute problem when us humans are involved. People don’t always tell the truth, you see, consciously or not. We are bad at estimating: research has found that we can’t estimate the calories we consume and expend exercising, for example. And we are over-confident about these bad estimations. When people are asked to give an interval for which an estimate is, say, 99% likely to be within, they get it wrong. Badly (2). This is especially true when we have lots of information. A study of horse-race handicappers (which I now can’t find, sorry) shows that as the amount of available information increases, accuracy of prediction does not increase but the confidence that the handicappers had in their predictions did. More information makes us more confident but in many instances does not increase performance (3).

We are also vulnerable to psychological traps. As Richard Thaler explained in Nudge: Improving Decisions About Health, Wealth, and Happiness, we can be coerced to make certain choices and act in certain ways. Daniel Kahneman and Amos Tversky spent 50 years studying these traps (4). Anchoring, availability, substitution, loss aversion, framing, sunk costs, prospect theory, etc. are all psychological quirks that make us subject to manipulation.

This all means that data based on human input is incredibly suspect, unless collected under strict conditions (which it never is) (5).

Data from experiments, too, deserves similar scrutiny. No, “trust in science” doesn’t always work if this means believing the results of every “statistically significant” experiment. I won’t go too deeply into why – that’s more to do with statistics, which we are thankfully avoiding today. Just know that data from experiments isn’t all that it’s cracked up to be.

Luckily for us, finance isn’t typically plagued with the problems associated with these two data-generating mechanisms. Financial data tends to arrive problem-free. The closing price of Tesco is the closing price of Tesco, it’s pretty hard to fuck that up. 

Nevertheless, not all data is created equal. Numbers for the prices of shares in the FTSE 100 are probably more reliable and more accurate than numbers for the MSE (Mongolian Stock Exchange). The more niche the data is, the more likely it is to have come from a ropey source and the more likely it is that the numbers have been fudged.

Output

Ok, so the inputs can be shoddy. Unfortunately, neither I nor you can do an awful lot about it.

What we can do, though, is be honest with representation of the data.

Apples to oranges

One common problem is the mistake of inappropriate comparison.

Don’t compare the growth rate of one thing to the amount of another. Don’t compare high-quality data to low-quality data. Don’t compare data sets with dramatically different sample sizes (again, we find ourselves being pulled into statistics).

As a general rule, try and ensure you’re comparing apples to apples. False comparisons can be misleading.

Numbers and stuff

Numbers are much less effectively used when it comes to representing data. We would much rather see a pretty picture – it’s easier to process and looks faaaaar more convincing. Number abuse is also typically the realm of statistics, not data.

But there are still some tricks (which apply to visualisations, too). Firstly, we must be very aware of what definitions are used. “50% of healthy people don’t eat breakfast!” – what the fuck constitutes a healthy person? Secondly, one must be aware of what data was included from the sample and, most importantly, what wasn’t. “Outliers” are also often discarded, which would have otherwise changed the conclusions.

Percentages can also cause problems. Which growth rate is used? Geometric or arithmetic? If there is a percentage – what exactly is that a percentage of? Percentages can easily mislead innocent readers such as me and you. “Red meat increases your chance of getting kidney cancer by 10%!”. Ignoring the fact that this is probably based on a statistically-faulty epidemiological study, this number probably looks a lot scarier than it actually is. If your chance of getting kidney cancer is 0.01%, eating red meat will (according to the research) mean that your chance of contracting kidney cancer will increase to a whopping…0.011% (6).

Visualisation

Let’s start on the x-axis. When considering different time periods, the picture can look very different. For example:

FTSE 100 Closing Price. Source: Yahoo.
FTSE 100 Closing Price. Source: Yahoo.

People simply choose the timeline that best bolsters their argument.

It’s also important to remember that trends don’t last forever. A line seemingly behaving in a certain way might not always be as well-behaved and “predictable”. This can become clear when the timeline is changed.

The y-axis can cause just as much distortion. Simply changing the axis can change what the reader sees and understands.

These charts show the same data yet paint a very different picture.

Chart titles, too, can be used to hammer home a point usually not very subtly). They can lead the witness to a conclusion that isn’t really there.

…was there? Source: https://www.ck12.org/c/statistics/misleading-graphs-identify-misleading-statistics/lesson/Identification-of-Misleading-Statistics/.

Even the actual data – what is contained within the axes – can be contorted. Making data 2D when it should be 1D makes your argument more convincing. This is because doubling a 2D object in height increases its area by a factor of 4. This makes differences appear larger than they actually are.

Even though B is only 3 times bigger than A it appears to be 9 times bigger. Source: https://en.wikipedia.org/wiki/Misleading_graph.

Finally, if your data is that unfavourable that even all these tricks don’t work, just make the chart complicated and 99% of readers won’t bother to try and understand it.

Errrr what? Source: https://betterfigures.org/2012/11/28/too-much-information/.

Agendas

Why does horrible chart etiquette persist? Partly due to incompetence, granted, but this isn’t the reason for the majority of the problems. Most people know how to make an accurate representation of the data, they simply choose to mis-represent it to better-support their argument.

Climate Activists should target pirates, not oil conglomerates. This chart would be convincing if it wasn’t obviously bollocks. Source: https://www.indy100.com/offbeat/bizarre-correlations-that-will-leave-you-wishing-nicolas-cage-would-retire-7240456.

Because the same set of data, even without the use of statistics, can be used to tell multiple different stories. Many of these will be wrong.

Think of politicians talking about favourable economic data by selecting only those variables that are exhibiting growth over a very specific timeline. Or cigarette companies illustrating how smoking does not cause lung cancer by showing cumulative lung cancer cases with apparently no real change after the introduction of smoking. Even me. Look at the charts and numbers used on this site – there’s a good chance some of them will be slightly distorted to fit my argument.

The uncomfortable truth is that nearly everyone has an agenda. They have a story in their mind first, then select data and representations of that data to support this story. Not many operate the right way round – by looking at the data first and forming opinions after, rather than arriving at conclusions then latching on anything that supports this view and wielding data to fit their preconceived world-view.


Notes

(1) And I’m not even talking about analysis, either. Analysing data – better known as statistics – is even more treacherous ground, something we will talk more about in another post. Here, I’m just talking about the numbers, before any type of statistical analysis has been attempted.

(2) Try this with your friends (if you have any). Ask them to provide an interval for some number that they are 99% sure of. I’ll bet that 25%-50% of the time the actual number will be outside the interval.

(3) It might actually hurt performance.

(4) They wrote an excellent book about it: Thinking, Fast and Slow.

(5) All this (and more) is discussed in the How to Lie with Statistics by Darrell Huff. Some of the material is slightly out-dated but the message still rings true.

(6) If you’re interested in misleading, and downright incorrect, uses of numbers like this, read The Tiger That Isn’t: Seeing Through a World of Numbers by Michael Blastland.


Subscribe

See Privacy Policy here.