I noticed Jones’ book (Avoiding Data Pitfalls) from the reviews on Amazon and Goodreads. Few weeks ago I had a chance to read it. Below are my reading notes and main takeaways.
Overall impression: a valuable book for new analysts / data professionals. Approachable and lays out the main points of efficent data work. Exprienced analysts will probably recognise most of the points Jones makes, but it is still a good read.
The book lays out 7 categories of pitfalls where data professionals can fail and explains how to avoid these pitfalls. I will summarize them below:
Technical Tresspasses
This chapter had little interest for me. Jones talks about ensuring data quality and making correct joins. Jones expains well why these things are important but does not deliver any tips or techniques for the more exprienced .
Mathematical Miscues
In this chapter Jones focuses on misteks on how you can mess up when making calculations on our data. He details 5 main categories in this space:
A) Aggregations
Be wary of blindly relying on descriptive statistics and aggregations. Always explore the row level data and the overall shape of the data. What are the MIN / MAX values. Are there outliers. Is there missing data. Are there NULL values.
B) Missing values
Don’t ignore missing values / NULL values when graphing data. For example a line graph with NULL values can be misleading and makes the viewer think that the data does not have any 0 values.
C) Totals
When calculating totals make sure the data is clean. Make sure you know what rows are in the data. For example if the dataset contains rows with precalculated totals or duplicate rows then this will mess up your totals.
D) Percents
Don’t take the average of percents if you dont know the sample sizes. If the perfentage are derived from samples of different size the average calculation will produce incorrect results (see this blog).
E) Units
Make sure to check and double check the units of measurement when working with US / Rest of world data. Very common error type to assume the wrong unit of measurement (Celsius vs Fahrenheit, miles vs km etc)
Pitfall category 4 – Statistical slipups
Here Jones goes over some fundamental and common erros he has encountered when working in the analytics space
A) Descriptive statistics done wrong
Descriptive / summary statistics can be misleading when the underlying data does not follow the normal distribution. In such cases displaying only the mean or the median is unlikely to provide quality information.
Better to always first explore the shape of the data. Display the distubution of the data or the max / values for the end user. Consider displaying the standard deviation and variance if the audience can understand these.
Consider removing the outliers if possible.
In case of multimodal distributions consider breaking down the distribution with relevant dimensions to try and create a normal distribution.
B) Inferential statistics
Don’t shy away from the p value signifgance statistica. Use it for example when comparing means but be careful.
Keep in mind that a low or a high p value might just be due to chance.
Low p value does not equal proof that the null hypothesis is wrong.
Also dont cherry pick results with low p values and make decisions based on them. Always consider if the significant result is practical. Does the difference actually make any difference in the real world?
C) Sampling
Make sure the sample uses stratified sampling. That you don’t compare apples with oranges. Analyzing data with a low number of observations for one event vs high number of observations for a their event will produce incorrect results.
D) Sample size
Be wary of low sample size. Low sample size can make extreme values have disproportionate effect on descriptive statistics. One star student in a small school can raise the average score dramatically. Whereas in a big school the students results would hardly move the metric.
Pitfall category 5 – Analytical aberrations
More general observations on techniques analysts deploy in everyday work.
A) Why intuition matters
Jones makes a good point about that withouth human input we cannot derive value from data analytics. We can’t only rely on data and algorithms. In a world inceasingly overflowing with data, human intuition is the important spark plug that extracts value from this algorithms that process this mess.
In a nutshell, human intiution is invaluable for:
– Seeing which patterns really matter in the data
– Interpret what the data is telling us
– Decide where to best look next
– Know when to stop and take action.
– Know who to present the results and how
B) Extrapolation overconfidence
Extrapolation from incomplete data can be misleading as it never predicts the future only shows the trend based on past data points. No one knows what the future brings.
C) Can you interpolate missing datapoints?
Hiding detailed data points to show big changes is good way to drive a point but can also be misleading. End users interpolate missing values from an aggregated graph incorrectly. Detailed datapoints can show important information into the process and causes for the big changes. Especially when coupled with human intuition.
D) Forecasting
Forecasts are inherently biased. They can be honestly biased (because a good analyst knows they cannot predict the future) but also dishonestly (to promote an agenda). Keep this in mind.
E) Metrics
We as analyst have a tenency to cram our dashboards with as much data as possible. Maybe it’s to show off our hard work of we are simply mezmerized by the complexity of the problem and would like to share it. However this is not good. You should only present measures that really, really matter. People take actions based on measures. More importantly people take measurements of their performance and effort very personally. Showing unnecessary metrics can cause unexpected roadblocks that could have been easily avoided.
Pitfall 6 – Graphical Gaffes
Tips on making better charts.
A) Core purpose
The most important part of designing a good graph is knowing it’s intended use. You need to know the core purpose of the graph you are creating, who will use it, how they use it, what’s their level of data literacy, how much time do they have to use it. Once you know as much as you can go and design the graph accordingly. Just also remember to validate that the result meets these conditions and is usable.
B) Keep an eye out for opportunity
Don’t rule out unpopular chart types. Keep an open mind and a wide perspective of visualization types. Do this to spot opportunities. Some might not believe it but word clouds and pie charts have their perfect fit uses. 🥧
C) Optimize vs. Satisfice
Use your institution and knowledge of the purpose of the graph to know when to stop designing and present the results. Consider if you need to show utmost detail or do you need to get across a general idea.
Pitfall 7 – Design Dangers
More tips on making better charts and dashboards.
A) Colors
Be consisten in your use of color. Use one encoding. Be frugal in color use. Not all charts or dimensions need to be color coded. Highlight the important ones leave the rest in a mute color.
B) Add some art
Consider if adding some artistic design elements would improve the graph. This can help get the message across. This helps make the graph more memorable or approachable.
C) Usability
Test if the users of your graphs can actually use the graphs well. Observe your end users. Interview them. Add usability elements such such easy to use buttons with layouts that March the layout of the dashboard graph.
Conclusion
Overall I liked the book. It was an easy read. Although I had come across the points Jones discussed before, I still enjoyed recognizing the pitfalls and acknowledgeing them once more. This is a great book for new analysts. More experienced analysts can still get value out of it, if only for talking points.