Water quality in Estonian swimming pools?

Last year I decided to start swimming regularly. At the time I thought it would be a good time to look at what public data is available about public pool water quality in Estonia. I found that, with some exceptions, the vast majority of swimming pools in Estonia had passed the health board inspection. Whats more they had been doing this consistently since the 1990s. Good news, the pools were safe to swim in.

But, I still felt kinda funky after each time I went swimming. Maybe I’m just paranoid…

This got me thinking. Even if the pools are safe, they can still be dirty. So I decided to dig up the pool data again and take a closer look at the specific tests that the health board does.

!!! Keep in mind that I am in not in any way qualified to assess what water quality test are good or bad, or to criticize the methodology of the health board officials. This is just a curios look into the data sets that are publicly available.

Get the data

Estonia and The City of Tallinn has pretty good public data infrastructure. For example there is the national open data platform that collects public domain data sets from government institutions

https://opendata.riik.ee/en/

It’s a great resource and contains droves of interesting data waiting for analysis or visualization.

I was more interested in water quality data and sure enough the data was listed in the portal:
https://opendata.riik.ee/en/andmehulgad/basseinid/
https://opendata.riik.ee/en/andmehulgad/basseinide-veeproovid/
https://opendata.riik.ee/en/andmehulgad/ujulad/
No scraping necessary, that’s a great start 😀

It’s in XML…parse the data

Not so fast though. The data is offered in XML format. This presented some problems. I couldn’t read the XML data straight into an analysis tool. I had to extract and transform it to a more normalized shape first.

Admittedly I have never really felt comfortable parsing XML. I manage, but I have the sinking feeling I am doing something completely backwards. No point in moaning about it though. I fired up Python and got started.

I used ElementTree library to digest the XML in python. First I created a dictionary of lists of what I wanted my output data to look like. This was also temporary storage for the data I got out of the XML. I know there are some better ways to parse the file, but I wanted to get the data out quick so I just started looping over the different levels of the XML tree structure. Appending each records most granular level of detail into my directory. Afterwards I transformed the directory into a pandas dataframe and saved the output in a csv.

import urllib
import pandas as pd
import xml.etree.cElementTree as et

url = 'http://vtiav.sm.ee/index.php/opendata/basseini_veeproovid.xml'

parsed_data = et.parse(urllib.request.urlopen(url))

proc_data_dict = {
    'pool_id': [],
    'pool_nm': [],
    'test_dt': [],
    'proc_id': [],
    'proc_rs': [],
    'proc_test_nm': [],
    'proc_test_rs': [],
    'proc_test_val': [],
    'proc_test_unit': []
}

tree = parsed_data
root = tree.getroot()

for i in root:
    for i2 in i:
        for i3 in i2:
            if i3.find('hinnang') != None:
                for i4 in i3:
                    for i5 in i4:
                        if i5.find('id') != None:                            
                           proc_data_dict['proc_id'].append(i3.find('katseprotokolli_number').text)
                           proc_data_dict['proc_rs'].append(i3.find('hinnang').text)
                           proc_data_dict['proc_test_nm'].append(i5.find('nimetus').text)
                           proc_data_dict['proc_test_rs'].append(i5.find('hinnang').text)
                           proc_data_dict['proc_test_val'].append(i5.find('sisaldus').text)
                           proc_data_dict['proc_test_unit'].append(i5.find('yhik').text)
                           proc_data_dict['pool_id'].append(i.find('bassein_id').text)
                           proc_data_dict['pool_nm'].append(i.find('bassein').text)
                           proc_data_dict['test_dt'].append(i.find('proovivotu_aeg').text)

pd.DataFrame(proc_data_dict).head()

Not too bad. I got the data out, but that nested for loop hurts my brain. Here is the output:

pool_id pool_nm test_dt proc_id proc_rs proc_test_nm proc_test_rs proc_test_val proc_test_unit
0 344 Laine SPA bassein 01.10.2020 09:00 KL2015/V48283K ei vasta nÔuetele Seotud kloor (<31 °C) ei vasta nÔuetele 1.4
1 344 Laine SPA bassein 01.10.2020 09:00 KL2015/V48283K ei vasta nÔuetele HÀgusus vastab nÔuetele 1.0
column name description
pool_id: unique id for each pool in a recreation center
pool_nm: pool name (in Estonian)
test_dt: date of procedure
proc_id: unique id for procedure
proc_rs: procedure result (vastab nĂ”uetele – test passed ; ei vasta nĂ”uetele – test not passed)
proc_test_nm: name of test in a procedure
proc_test_rs: result of test in a procedure
proc_test_val: value of test in a procedure
proc_test_unit: mesuring unit of test value

Analyze and visualize

At first I wanted do to the whole process in Python.

But I was getting a bit tired and when I’m tired or lazy I prefer to explore data in Tableau.
I feel like it gives me a more responsive experience to follow my thoughts and to quickly chase any rabbit holes I find.

I’m not going to look at the ok/not ok’s the health board gives out that indicates when the pool is unsafe to swim in and needs to be closed. Instead I want to look at the detail. There are dozens of test procedures that go into measuring if a pool is clean or not. These include but are not limited to :

  • Levels of ammonia
  • Levels of Coli type bacteria
  • Levels of all kinds of other bacterial
  • Levels of chlorination
  • Transparency of water

Considering this I generated new fields in the data

  • total number of test procedures
  • no of failed test procedures
  • ratio of failed to total test procedures (failure rate)
  • median monthly failure rate for each pool

Using these new fields I focused on the top (or rather worst) performers. The top 11 pools that have a median test procedure failure rate of over 50%.

This dubious honor goes to the following establishments:

Pool name Median test procedure failure rate
Kreutzwald Hotel Tallinn Zen-SPA 50%
Kristiine spordihalli ujula 50%
MTÜ Spordiklubi Garant ujula 50%
Sanatoorium Tervis Vabaajakeskus 50%
Tervise Paradiisi veekeskus 50%
Viiking Saaga vee- ja saunakeskus (E-korpus) 56%
Estonia Termid 57%
PANNJÄRVE TERVISESPORDIKESKUS SA ujula 58%
PĂ€rnu Spordikooli ujula 62%
Tootsi 62%
TRK Viiking saunakeskus (C-korpus) 73%

I also created an interactive dashboard and loaded in onto Tableau Public. Click below to take a look.

  • You can see the median failure rate on the left.
  • The historical failure rate of the top 11 is displayed in the top right.
  • The scatter plot on the bottom right plots the historical failure rate for all pools that were still active in 2019.

When you hover on the marks on the scatter plot you can also see a detailed breakdown of total vs failed tests and the specific tests each pool failed at each year.

There we go. Yes, some pools might still be dirty, I am not paranoid 🙂

I would love to get back to this data in the future. I would like to learn more about the tests the health board run, what are each tests implications on health and what are the official limits for each test.

The data set is also quite rich and I only extracted a slice of it. If you are curious you could look at the performance of specific tests or search for correlations between seasonality and test results. Maybe the capacity of the pools and the locations of the pools also affect the water quality measurements?

I also uploaded the code and the csv’s to my github

All the best

Rainer

Leave a comment

Design a site like this with WordPress.com
Get started