A post about data denormalization and uncovering dirty data.

Thats Not Normal.

Thats Not Normal.

Lets chat denormalization! Here’s a list of the tables in my database. Ive collected stats from the 2013 TdF for the Yellow (GC), Green (Points) and Polka Dot (Mountains) jersey competitions – the dependent variables. The other tables have stats on rider performance in specific areas throughout the season which I’m hoping will give some indication of TdF performance

Totally tabular!

Totally tabular!

Now that Ive got a big ole database full of stats its time to denormalize the data for analysis. You can find the SQL script I used here. At the time of this writing the script was recently updated to include adding 0s for null points, thanks to this StackOverflow article for help on how to provide default values for nulls.

I chose to use left outer joins from the tdf_gc table based on rider name. This means I will have a table with rows for every rider from the tdf_gc table with columns from the other stats tables I join. If a rider doesnt have stats in a given table (i.e., I wouldnt expect sprinter Mark Cavendish to have an entry in the individual_mountains table) then a null is placed in that column.

Kind of an eye chart, but if you click on that image you will notice that Joaquim Rodriguez is missing quite a lot of information. Doesnt make sense that someone ranked so high in the TdF GC doesnt have any entry in the season GC or PCS Rank.

Thats Not Normal.

Poking around I discovered the culprit!


RODRíGUEZ, RODRÍGUEZ! Or, international case study

Just use UPPER() or LOWER()! Not so fast podnah, that “Í” is a character especial. Fortunately I’m using Postgres 9.3 so I can specify locale on a per query basis.

Using Collate to specify locale

Using “collate” to specify locale

That works for this case, but what about the rest of the riders from other countries? Apart from inspecting each rider manually I am presently unaware of another method for doing case conversion using a dynamic locale. Perhaps I would have benefited from screening the data for such characters especial before inserting them into my database.

Lance Armstrong

I notice that another rider, Daniel Garcia Navarro, is also missing similar data to Rodriguez. Wondering if the same internationalization issue is to blame I check it out.

Le Sigh.

Le Sigh.

An internationalization issue indeed! Some of my data includes both “last names” for Navarro while other data does not. Sadly there is no SQL entity for screening out irregularities in latin last name conventions vs middle names vs multiple first names.

Remember in my last post when I said I probably have more data cleaning to do?

sad Lance

Ive heard it said that data science is primarily about getting the data clean, so I’ll step away from the mic to work some more on the data set. I’m disappointed that I wont have a model to help me pick my Fantasy Tour de France team, but at least I have potato salad and apple pie to console me today. And unicorns, fireworks, etc

Happy Independence Day US!

Happy Birfday #merica!

Leave a Reply

Your email address will not be published. Required fields are marked *