Booknotes: Avoiding data pitfalls by Ben Jones

I noticed Jones’ book (Avoiding Data Pitfalls) from the reviews on Amazon and Goodreads. Few weeks ago I had a chance to read it. Below are my reading notes and main takeaways.

Overall impression: a valuable book for new analysts / data professionals. Approachable and lays out the main points of efficent data work. Exprienced analysts will probably recognise most of the points Jones makes, but it is still a good read.

The book lays out 7 categories of pitfalls where data professionals can fail and explains how to avoid these pitfalls. I will summarize them below:

Technical Tresspasses

This chapter had little interest for me. Jones talks about ensuring data quality and making correct joins. Jones expains well why these things are important but does not deliver any tips or techniques for the more exprienced .

Mathematical Miscues

In this chapter Jones focuses on misteks on how you can mess up when making calculations on our data. He details 5 main categories in this space:

A) Aggregations

Be wary of blindly relying on descriptive statistics and aggregations. Always explore the row level data and the overall shape of the data. What are the MIN / MAX values. Are there outliers. Is there missing data. Are there NULL values.

B) Missing values

Don’t ignore missing values / NULL values when graphing data. For example a line graph with NULL values can be misleading and makes the viewer think that the data does not have any 0 values.

C) Totals

When calculating totals make sure the data is clean. Make sure you know what rows are in the data. For example if the dataset contains rows with precalculated totals or duplicate rows then this will mess up your totals.

D) Percents

Don’t take the average of percents if you dont know the sample sizes. If the perfentage are derived from samples of different size the average calculation will produce incorrect results (see this blog).

E) Units

Make sure to check and double check the units of measurement when working with US / Rest of world data. Very common error type to assume the wrong unit of measurement (Celsius vs Fahrenheit, miles vs km etc)

Pitfall category 4 – Statistical slipups

Here Jones goes over some fundamental and common erros he has encountered when working in the analytics space

A) Descriptive statistics done wrong

Descriptive / summary statistics can be misleading when the underlying data does not follow the normal distribution. In such cases displaying only the mean or the median is unlikely to provide quality information.

Better to always first explore the shape of the data. Display the distubution of the data or the max / values for the end user. Consider displaying the standard deviation and variance if the audience can understand these.
Consider removing the outliers if possible.
In case of multimodal distributions consider breaking down the distribution with relevant dimensions to try and create a normal distribution.

B) Inferential statistics

Don’t shy away from the p value signifgance statistica. Use it for example when comparing means but be careful.

Keep in mind that a low or a high p value might just be due to chance.
Low p value does not equal proof that the null hypothesis is wrong.
Also dont cherry pick results with low p values and make decisions based on them. Always consider if the significant result is practical. Does the difference actually make any difference in the real world?

C) Sampling

Make sure the sample uses stratified sampling. That you don’t compare apples with oranges. Analyzing data with a low number of observations for one event vs high number of observations for a their event will produce incorrect results.

D) Sample size

Be wary of low sample size. Low sample size can make extreme values have disproportionate effect on descriptive statistics. One star student in a small school can raise the average score dramatically. Whereas in a big school the students results would hardly move the metric.

Pitfall category 5 – Analytical aberrations

More general observations on techniques analysts deploy in everyday work.

A) Why intuition matters

Jones makes a good point about that withouth human input we cannot derive value from data analytics. We can’t only rely on data and algorithms. In a world inceasingly overflowing with data, human intuition is the important spark plug that extracts value from this algorithms that process this mess.

In a nutshell, human intiution is invaluable for:

– Seeing which patterns really matter in the data
– Interpret what the data is telling us
– Decide where to best look next
– Know when to stop and take action.
– Know who to present the results and how

B) Extrapolation overconfidence

Extrapolation from incomplete data can be misleading as it never predicts the future only shows the trend based on past data points. No one knows what the future brings.

C) Can you interpolate missing datapoints?

Hiding detailed data points to show big changes is good way to drive a point but can also be misleading. End users interpolate missing values from an aggregated graph incorrectly. Detailed datapoints can show important information into the process and causes for the big changes. Especially when coupled with human intuition.

D) Forecasting

Forecasts are inherently biased. They can be honestly biased (because a good analyst knows they cannot predict the future) but also dishonestly (to promote an agenda). Keep this in mind.

E) Metrics

We as analyst have a tenency to cram our dashboards with as much data as possible. Maybe it’s to show off our hard work of we are simply mezmerized by the complexity of the problem and would like to share it. However this is not good. You should only present measures that really, really matter. People take actions based on measures. More importantly people take measurements of their performance and effort very personally. Showing unnecessary metrics can cause unexpected roadblocks that could have been easily avoided.

Pitfall 6 – Graphical Gaffes

Tips on making better charts.

A) Core purpose

The most important part of designing a good graph is knowing it’s intended use. You need to know the core purpose of the graph you are creating, who will use it, how they use it, what’s their level of data literacy, how much time do they have to use it. Once you know as much as you can go and design the graph accordingly. Just also remember to validate that the result meets these conditions and is usable.

B) Keep an eye out for opportunity

Don’t rule out unpopular chart types. Keep an open mind and a wide perspective of visualization types. Do this to spot opportunities. Some might not believe it but word clouds and pie charts have their perfect fit uses. 🥧

C) Optimize vs. Satisfice

Use your institution and knowledge of the purpose of the graph to know when to stop designing and present the results. Consider if you need to show utmost detail or do you need to get across a general idea.

Pitfall 7 – Design Dangers

More tips on making better charts and dashboards.

A) Colors

Be consisten in your use of color. Use one encoding. Be frugal in color use. Not all charts or dimensions need to be color coded. Highlight the important ones leave the rest in a mute color.

B) Add some art

Consider if adding some artistic design elements would improve the graph. This can help get the message across. This helps make the graph more memorable or approachable.

C) Usability

Test if the users of your graphs can actually use the graphs well. Observe your end users. Interview them. Add usability elements such such easy to use buttons with layouts that March the layout of the dashboard graph.

Conclusion

Overall I liked the book. It was an easy read. Although I had come across the points Jones discussed before, I still enjoyed recognizing the pitfalls and acknowledgeing them once more. This is a great book for new analysts. More experienced analysts can still get value out of it, if only for talking points.