Booknotes: Avoiding data pitfalls by Ben Jones

I noticed Jones’ book (Avoiding Data Pitfalls) from the reviews on Amazon and Goodreads. Few weeks ago I had a chance to read it. Below are my reading notes and main takeaways.

Overall impression: a valuable book for new analysts / data professionals. Approachable and lays out the main points of efficent data work. Exprienced analysts will probably recognise most of the points Jones makes, but it is still a good read.

The book lays out 7 categories of pitfalls where data professionals can fail and explains how to avoid these pitfalls. I will summarize them below:

Technical Tresspasses

This chapter had little interest for me. Jones talks about ensuring data quality and making correct joins. Jones expains well why these things are important but does not deliver any tips or techniques for the more exprienced .

Mathematical Miscues

In this chapter Jones focuses on misteks on how you can mess up when making calculations on our data. He details 5 main categories in this space:

A) Aggregations

Be wary of blindly relying on descriptive statistics and aggregations. Always explore the row level data and the overall shape of the data. What are the MIN / MAX values. Are there outliers. Is there missing data. Are there NULL values.

B) Missing values

Don’t ignore missing values / NULL values when graphing data. For example a line graph with NULL values can be misleading and makes the viewer think that the data does not have any 0 values.

C) Totals

When calculating totals make sure the data is clean. Make sure you know what rows are in the data. For example if the dataset contains rows with precalculated totals or duplicate rows then this will mess up your totals.

D) Percents

Don’t take the average of percents if you dont know the sample sizes. If the perfentage are derived from samples of different size the average calculation will produce incorrect results (see this blog).

E) Units

Make sure to check and double check the units of measurement when working with US / Rest of world data. Very common error type to assume the wrong unit of measurement (Celsius vs Fahrenheit, miles vs km etc)

Pitfall category 4 – Statistical slipups

Here Jones goes over some fundamental and common erros he has encountered when working in the analytics space

A) Descriptive statistics done wrong

Descriptive / summary statistics can be misleading when the underlying data does not follow the normal distribution. In such cases displaying only the mean or the median is unlikely to provide quality information.

Better to always first explore the shape of the data. Display the distubution of the data or the max / values for the end user. Consider displaying the standard deviation and variance if the audience can understand these.
Consider removing the outliers if possible.
In case of multimodal distributions consider breaking down the distribution with relevant dimensions to try and create a normal distribution.

B) Inferential statistics

Don’t shy away from the p value signifgance statistica. Use it for example when comparing means but be careful.

Keep in mind that a low or a high p value might just be due to chance.
Low p value does not equal proof that the null hypothesis is wrong.
Also dont cherry pick results with low p values and make decisions based on them. Always consider if the significant result is practical. Does the difference actually make any difference in the real world?

C) Sampling

Make sure the sample uses stratified sampling. That you don’t compare apples with oranges. Analyzing data with a low number of observations for one event vs high number of observations for a their event will produce incorrect results.

D) Sample size

Be wary of low sample size. Low sample size can make extreme values have disproportionate effect on descriptive statistics. One star student in a small school can raise the average score dramatically. Whereas in a big school the students results would hardly move the metric.

Pitfall category 5 – Analytical aberrations

More general observations on techniques analysts deploy in everyday work.

A) Why intuition matters

Jones makes a good point about that withouth human input we cannot derive value from data analytics. We can’t only rely on data and algorithms. In a world inceasingly overflowing with data, human intuition is the important spark plug that extracts value from this algorithms that process this mess.

In a nutshell, human intiution is invaluable for:

– Seeing which patterns really matter in the data
– Interpret what the data is telling us
– Decide where to best look next
– Know when to stop and take action.
– Know who to present the results and how

B) Extrapolation overconfidence

Extrapolation from incomplete data can be misleading as it never predicts the future only shows the trend based on past data points. No one knows what the future brings.

C) Can you interpolate missing datapoints?

Hiding detailed data points to show big changes is good way to drive a point but can also be misleading. End users interpolate missing values from an aggregated graph incorrectly. Detailed datapoints can show important information into the process and causes for the big changes. Especially when coupled with human intuition.

D) Forecasting

Forecasts are inherently biased. They can be honestly biased (because a good analyst knows they cannot predict the future) but also dishonestly (to promote an agenda). Keep this in mind.

E) Metrics

We as analyst have a tenency to cram our dashboards with as much data as possible. Maybe it’s to show off our hard work of we are simply mezmerized by the complexity of the problem and would like to share it. However this is not good. You should only present measures that really, really matter. People take actions based on measures. More importantly people take measurements of their performance and effort very personally. Showing unnecessary metrics can cause unexpected roadblocks that could have been easily avoided.

Pitfall 6 – Graphical Gaffes

Tips on making better charts.

A) Core purpose

The most important part of designing a good graph is knowing it’s intended use. You need to know the core purpose of the graph you are creating, who will use it, how they use it, what’s their level of data literacy, how much time do they have to use it. Once you know as much as you can go and design the graph accordingly. Just also remember to validate that the result meets these conditions and is usable.

B) Keep an eye out for opportunity

Don’t rule out unpopular chart types. Keep an open mind and a wide perspective of visualization types. Do this to spot opportunities. Some might not believe it but word clouds and pie charts have their perfect fit uses. 🥧

C) Optimize vs. Satisfice

Use your institution and knowledge of the purpose of the graph to know when to stop designing and present the results. Consider if you need to show utmost detail or do you need to get across a general idea.

Pitfall 7 – Design Dangers

More tips on making better charts and dashboards.

A) Colors

Be consisten in your use of color. Use one encoding. Be frugal in color use. Not all charts or dimensions need to be color coded. Highlight the important ones leave the rest in a mute color.

B) Add some art

Consider if adding some artistic design elements would improve the graph. This can help get the message across. This helps make the graph more memorable or approachable.

C) Usability

Test if the users of your graphs can actually use the graphs well. Observe your end users. Interview them. Add usability elements such such easy to use buttons with layouts that March the layout of the dashboard graph.

Conclusion

Overall I liked the book. It was an easy read. Although I had come across the points Jones discussed before, I still enjoyed recognizing the pitfalls and acknowledgeing them once more. This is a great book for new analysts. More experienced analysts can still get value out of it, if only for talking points.

(re)Learning SQL – Leading and Lagging

Lead, Lag and others are SQL analytic functions. Just another set of tools to help with querying different reporting tables. The syntax is similar to other window functions:

<analytic function> OVER (...)

The difference is that while window functions aggregate a relative window of rows from the home row. Analytics functions lookup a specific row from the home row.

LEAD & LAG

These two return the preceding (LAG) or following (LEAD) row value that you specify.

SELECT
  date, 
  metric,
  LEAD(metric) OVER(ORDER BY date)
FROM table_1;

The line LEAD(col_1) OVER(ORDER BY col_2)

column_1 column_2 lead
date_1 1 2
date_2 2 3
date_3 3 null

For example LAG can be used for calculating difference to previous month:

select
  date,
  metric,
  metric - LAG(metric) OVER(ORDER BY date) 
  as difference_to_prev
FROM
table_1;
date metric difference_to_prev
date_1 1 NULL
date_2 2 1
date_3 3 1

LEAD(x,y,z) and LAG(x,y,z)

Both LEAD() and LAG() can also take arguments. These work like this:

LEAD(x,y,z)
x = the column you want to return
y = the offset, or how many rows forward or backward you want to go to return a value
z = value to return if nothing is found. For example when just using LEAD(x) the query would return NULL for the last row in the table. Because there is no rows after the last row.

Note that, for value z, our replacement value must be of the same data type as the column. So assuming the revenue column is a integer type, if we want to fix the last row for LEAD(revenue) we must write LEAD(revenue, 1, 0) and NOT LEAD(revenue,1,’nothing to compare’). 

SELECT
  date,
  metric,
  LAG(metric,3,0) OVER(ORDER BY date)
FROM table_1;
date metric difference_to_prev_3
date_1 100 0
date_2 110 0
date_3 90 -10
date_4 140 40
date_5 150 50

Note that we can actually use either LEAD or LAG to lookup in both directions. For this we need to specify the ORDER BY argument to either ASC or DESC depending on the direction we want to lookup.

LAG(…) OVER(ORDER BY …) is the same as LEAD(…) OVER (ORDER BY … ASC)

FIRST_VALUE, LAST_VALUE and NTH_VALUE

What if we want to look up a specifc row from the table? This is where these guys come in:

FIRST_VALUE- returns the first value in the partition
LAST_VALUE- returns the last value in the partition
NTH_VALUE- retursn the nth value n the partition

Seems pretty self explanatory. So how would we use them?

Let’s say we want to compare each row to the overall maximum and minimum values of that colum

SELECT
   date,
   metric,
FIRST_VALUE(metric) 
   OVER(ORDER BY metric ASC
   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
   ) as min_metric,
LAST_VALUE(column_2) 
   OVER(ORDER BY metric ASC
   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
   ) as max_metric
FROM
table_1;
date metric min_metric max_metric
date_1 100 90 150
date_2 110 90 150
date_3 90 90 150
date_4 150 90 150
date_5 140 90 150

Let’s see whats happening here

FIRST_VALUE(column_2) 
   OVER(ORDER BY column_2 ASC
   ROWS BETWEEN UNBOUNDED PRECEDING 
   AND UNBOUNDED FOLLOWING)

We are telling the database to get the first value of metric when it is ordered in an ascending order. This would return us the smallest value for metric. Note that we could also order metric by another column ( date ). This would return the value from metric that is linked to the smallest value in date. (the metric value on the earliest date).

We are also specifying the window of the calculation: 
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

This tells the database to use the whole table as a windows. If we wanted we could fine tune the window to use on N number of rows in each direction.

NTH Value

Maybe you want to show the highest value in a partition. OR the 3rd highest, or the 5th lowest.

SELECT
   date,
   metric,
LAST_VALUE(metric) 
   OVER(ORDER BY metric ASC
   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
   ) as top_1_metric,
NTH_VALUE(metric,2) 
   OVER(ORDER BY metric DESC
   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
   ) as top_2_metric,
NTH_VALUE(metric,3) 
   OVER(ORDER BY metric DESC
   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
   ) as top_3_metric
FROM
table_1;
date metric top_1_metric top_2_metric top_3_metric
date_1 100 150 140 110
date_2 110 150 140 110
date_3 90 150 140 110
date_4 140 150 140 110
date_5 150 150 140 110

There is much more to analytic functions in SQL and each database has some variations and special functions implemented. However, these above are the base analytic functions that should you help grasp other more specialized ones as well.

Water quality in Estonian swimming pools?

Last year I decided to start swimming regularly. At the time I thought it would be a good time to look at what public data is available about public pool water quality in Estonia. I found that, with some exceptions, the vast majority of swimming pools in Estonia had passed the health board inspection. Whats more they had been doing this consistently since the 1990s. Good news, the pools were safe to swim in.

But, I still felt kinda funky after each time I went swimming. Maybe I’m just paranoid…

This got me thinking. Even if the pools are safe, they can still be dirty. So I decided to dig up the pool data again and take a closer look at the specific tests that the health board does.

!!! Keep in mind that I am in not in any way qualified to assess what water quality test are good or bad, or to criticize the methodology of the health board officials. This is just a curios look into the data sets that are publicly available.

Get the data

Estonia and The City of Tallinn has pretty good public data infrastructure. For example there is the national open data platform that collects public domain data sets from government institutions

https://opendata.riik.ee/en/

It’s a great resource and contains droves of interesting data waiting for analysis or visualization.

I was more interested in water quality data and sure enough the data was listed in the portal:
https://opendata.riik.ee/en/andmehulgad/basseinid/
https://opendata.riik.ee/en/andmehulgad/basseinide-veeproovid/
https://opendata.riik.ee/en/andmehulgad/ujulad/
No scraping necessary, that’s a great start 😀

It’s in XML…parse the data

Not so fast though. The data is offered in XML format. This presented some problems. I couldn’t read the XML data straight into an analysis tool. I had to extract and transform it to a more normalized shape first.

Admittedly I have never really felt comfortable parsing XML. I manage, but I have the sinking feeling I am doing something completely backwards. No point in moaning about it though. I fired up Python and got started.

I used ElementTree library to digest the XML in python. First I created a dictionary of lists of what I wanted my output data to look like. This was also temporary storage for the data I got out of the XML. I know there are some better ways to parse the file, but I wanted to get the data out quick so I just started looping over the different levels of the XML tree structure. Appending each records most granular level of detail into my directory. Afterwards I transformed the directory into a pandas dataframe and saved the output in a csv.

import urllib
import pandas as pd
import xml.etree.cElementTree as et

url = 'http://vtiav.sm.ee/index.php/opendata/basseini_veeproovid.xml'

parsed_data = et.parse(urllib.request.urlopen(url))

proc_data_dict = {
    'pool_id': [],
    'pool_nm': [],
    'test_dt': [],
    'proc_id': [],
    'proc_rs': [],
    'proc_test_nm': [],
    'proc_test_rs': [],
    'proc_test_val': [],
    'proc_test_unit': []
}

tree = parsed_data
root = tree.getroot()

for i in root:
    for i2 in i:
        for i3 in i2:
            if i3.find('hinnang') != None:
                for i4 in i3:
                    for i5 in i4:
                        if i5.find('id') != None:                            
                           proc_data_dict['proc_id'].append(i3.find('katseprotokolli_number').text)
                           proc_data_dict['proc_rs'].append(i3.find('hinnang').text)
                           proc_data_dict['proc_test_nm'].append(i5.find('nimetus').text)
                           proc_data_dict['proc_test_rs'].append(i5.find('hinnang').text)
                           proc_data_dict['proc_test_val'].append(i5.find('sisaldus').text)
                           proc_data_dict['proc_test_unit'].append(i5.find('yhik').text)
                           proc_data_dict['pool_id'].append(i.find('bassein_id').text)
                           proc_data_dict['pool_nm'].append(i.find('bassein').text)
                           proc_data_dict['test_dt'].append(i.find('proovivotu_aeg').text)

pd.DataFrame(proc_data_dict).head()

Not too bad. I got the data out, but that nested for loop hurts my brain. Here is the output:

pool_id pool_nm test_dt proc_id proc_rs proc_test_nm proc_test_rs proc_test_val proc_test_unit
0 344 Laine SPA bassein 01.10.2020 09:00 KL2015/V48283K ei vasta nõuetele Seotud kloor (<31 °C) ei vasta nõuetele 1.4
1 344 Laine SPA bassein 01.10.2020 09:00 KL2015/V48283K ei vasta nõuetele Hägusus vastab nõuetele 1.0
column name description
pool_id: unique id for each pool in a recreation center
pool_nm: pool name (in Estonian)
test_dt: date of procedure
proc_id: unique id for procedure
proc_rs: procedure result (vastab nõuetele – test passed ; ei vasta nõuetele – test not passed)
proc_test_nm: name of test in a procedure
proc_test_rs: result of test in a procedure
proc_test_val: value of test in a procedure
proc_test_unit: mesuring unit of test value

Analyze and visualize

At first I wanted do to the whole process in Python.

But I was getting a bit tired and when I’m tired or lazy I prefer to explore data in Tableau.
I feel like it gives me a more responsive experience to follow my thoughts and to quickly chase any rabbit holes I find.

I’m not going to look at the ok/not ok’s the health board gives out that indicates when the pool is unsafe to swim in and needs to be closed. Instead I want to look at the detail. There are dozens of test procedures that go into measuring if a pool is clean or not. These include but are not limited to :

  • Levels of ammonia
  • Levels of Coli type bacteria
  • Levels of all kinds of other bacterial
  • Levels of chlorination
  • Transparency of water

Considering this I generated new fields in the data

  • total number of test procedures
  • no of failed test procedures
  • ratio of failed to total test procedures (failure rate)
  • median monthly failure rate for each pool

Using these new fields I focused on the top (or rather worst) performers. The top 11 pools that have a median test procedure failure rate of over 50%.

This dubious honor goes to the following establishments:

Pool name Median test procedure failure rate
Kreutzwald Hotel Tallinn Zen-SPA 50%
Kristiine spordihalli ujula 50%
MTÜ Spordiklubi Garant ujula 50%
Sanatoorium Tervis Vabaajakeskus 50%
Tervise Paradiisi veekeskus 50%
Viiking Saaga vee- ja saunakeskus (E-korpus) 56%
Estonia Termid 57%
PANNJÄRVE TERVISESPORDIKESKUS SA ujula 58%
Pärnu Spordikooli ujula 62%
Tootsi 62%
TRK Viiking saunakeskus (C-korpus) 73%

I also created an interactive dashboard and loaded in onto Tableau Public. Click below to take a look.

  • You can see the median failure rate on the left.
  • The historical failure rate of the top 11 is displayed in the top right.
  • The scatter plot on the bottom right plots the historical failure rate for all pools that were still active in 2019.

When you hover on the marks on the scatter plot you can also see a detailed breakdown of total vs failed tests and the specific tests each pool failed at each year.

There we go. Yes, some pools might still be dirty, I am not paranoid 🙂

I would love to get back to this data in the future. I would like to learn more about the tests the health board run, what are each tests implications on health and what are the official limits for each test.

The data set is also quite rich and I only extracted a slice of it. If you are curious you could look at the performance of specific tests or search for correlations between seasonality and test results. Maybe the capacity of the pools and the locations of the pools also affect the water quality measurements?

I also uploaded the code and the csv’s to my github

All the best

Rainer

(re)Learning SQL – Window functions

Some time last year I felt that I needed to brush up on SQL. I took a few evenings to skim some courses and online resources. (Stackoverflow, Vertabello Academy, db-fiddle.com)

Below are some of my thoughts and notes from going over the material.

* Window functions
* Rows and range
* Partition by

Window Functions 

I have been using window functions for calculating simple cross table totals and rolling totals.
This is pretty simple and can be achieved just with
SUM(column) OVER()
for getting the total for the whole table.

We can use something like:
SUM(colum) OVER(ORDER BY date_column ROWS BETWEEN UNBOUNDED PREVIOUS AND CURRENT ROW) for the row by row rolling total. Lets get into the last one a bit more.

CREATE TABLE orders (
placed_dt DATE,
total_price NUMERIC
);
INSERT INTO orders(placed_dt, total_price)
VALUES
('2020-01-01',1.23), 
('2020-01-02',2.50), 
('2020-01-03',5.23), 
('2020-01-04',1.50), 
('2020-01-04',9.53), 
('2020-01-06',8.90), 
('2020-01-07',0.23), 
('2020-01-08',5.50), 
('2020-01-09',1.23), 
('2020-01-10',2.60);

SELECT
placed_dt,
SUM(total_price) OVER(
  ORDER BY placed_dt
  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
from orders


We are calculating a rolling sum of total_price. For each row the sum of current row and all preceding rows. Here OVERdefines the start of the window calculation. That means the sub selection (window) of rows we want the SUM calculation to work on. ORDER BY defines the order of the rows in the window calculation. ROWS can use the following arguments to define the size of the window for the calculation.

UNBOUNDED PRECEDING
CURRENT ROW
UNBOUNDED FOLLOWING
BETWEEN [number] PRECEDING AND [number] FOLLOWING

These can be abbreviated by ommiting either the FOLLOWING or PRECEDING if all we want is to calculate in one direction (until the start or end of the table) in relation to the current row

  • ROWS UNBOUNDED PRECEDING means BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  • ROWS [number] PRECEDING means BETWEEN [number] PRECEDING AND CURRENT ROW
  • ROWS CURRENT ROW means BETWEEN CURRENT ROW AND CURRENT ROW

Window functions nested in calculated fields

Window functions can also be used inside calculated fields. Below an example of calculating the ratio of a purchase to the average purchase price in a 3 day window:

SELECT
      placed_dt,
      total_price,
      total_price / (AVG(total_price) OVER (
           ORDER BY placed_dt 
           ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)) as ratio
FROM orders

*remember that ROW constraints (i.e. what you write after ORDER BY) must be in logical order, from preceding to following row constraint. So we need to write:

ORDER BY year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW

NOT

ORDER BY year ROWS BETWEEN CURRENT ROW AND 2 PRECEDING

Rows and Range

We could also use RANGE instead of ROWS in the window function.

The difference between ROWS and RANGE is that while ROWS will apply the window function to all the rows specified, RANGE will limit the window function to any rows that have the same value in the column which we order by.

ROWS applies the window function across any rows that are specified

SELECT
placed_dt,
total_price,
SUM(total_price) OVER (
  ORDER BY placed_dt
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as window_rolling_sum
FROM orders


Output:

| placed_dt  | total_price | window_rolling_sum |
| ---------- | ----------- | ------------------ |
| 01-01-2020 | 1.23        | 1.23               |
| 02-01-2020 | 2.50        | 3.73               |
| 03-01-2020 | 5.23        | 8.96               |
| 04-01-2020 | 1.50        | 10.46              |<- grouped row by row
| 04-01-2020 | 9.53        | 19.99              |<- grouped row by row


RANGE groups the window function by the ORDER BY columns

SELECT
placed_dt,
total_price,
SUM(total_price) OVER (
  ORDER BY placed_dt
  RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as window_rolling_sum
FROM orders


Output:

| placed_dt  | total_price | window_rolling_sum |
| ---------- | ----------- | ------------------ |
| 01-01-2020 | 1.23        | 1.23               |
| 02-01-2020 | 2.50        | 3.73               |
| 03-01-2020 | 5.23        | 8.96               |
| 04-01-2020 | 1.50        | 19.99              |<- grouped by placed_dt
| 04-01-2020 | 9.53        | 19.99              |<- grouped by placed_dt


Using Partition By

We can do similar grouping by forgetting about ORDER BY and using PARTITION BY

SELECT
placed_dt,
total_price,
SUM(total_price) OVER(
  PARTITION BY placed_dt)
FROM orders


Output:

| placed_dt  | total_price | window_sum |
| ---------- | ----------- | ---------- |
| 01-01-2020 | 1.23        | 1.23       |
| 01-02-2020 | 2.50        | 2.50       |
| 01-03-2020 | 5.23        | 5.23       |
| 01-04-2020 | 1.50        | 11.03      |<- grouped by placed_dt
| 01-04-2020 | 9.53        | 11.03      |<- grouped by placed_dt


Order of operations

This is a good spot to refresh general rules for SQL order of operations. That is in what order the database processes your query. Starting from the first process:

  1. FROM – the database gets the data from tables in FROM clause and if necessary performs the JOINs,
  2. WHERE – the data are filtered with conditions specified in WHERE clause,
  3. GROUP BY – the data are grouped by with conditions specified in WHERE clause,
  4. aggregate functions — the aggregate functions are applied to the groups created in the GROUP BY phase,
  5. HAVING – the groups are filtered with the given condition,
  6. window functions,
  7. SELECT – the database selects the given columns,
  8. DISTINCT – repeated values are removed,
  9. UNION/INTERSECT/EXCEPT – the database applies set operations,
  10. ORDER BY – the results are sorted,
  11. OFFSET – the first rows are skipped,
  12. LIMIT/FETCH/TOP – only the first rows are selected

As we can see window functions get executed after WHERE, GROUP BY or HAVING. Practically, this order means that you can’t put window functions anywhere in the FROM, WHERE, GROUP BY or HAVING clauses.

To recap

  • You can define a window frame within OVER(...). The syntax is: [ROWS|RANGE] <window frame definition>.
  • <window frame definition> is defined with BETWEEN <lower bound> AND <upper bound>, where the bounds may be defined with:
  • UNBOUNDED PRECEDING,
  • n PRECEDING (ROWS only),
  • CURRENT ROW,
  • n FOLLOWING (ROWS only),
  • UNBOUNDED FOLLOWING
  • ROWS always treats rows individually (like the ROW_NUMBER() function), RANGE also considers rows which share the same value in the column we order by.
  • Window functions can only appear in the SELECT and ORDER BY clauses.
  • If you need window functions in other parts of the query, use a subquery.

Design a site like this with WordPress.com
Get started