Adventures in Data Science: Le Tour

Le Tour de France. A 3 week torture fest featuring svelte men in spandex rolling along the French countryside. Armed with (French) pastries, I enjoy tuning in ridiculously early in the morning to watch this soap opera on wheels.

Recently I stumbled across Pro Cycling Stats. It seemed to be a perfect intersection for my interests in cycling and data science, so I hatched a little project to see if I could predict the winner of Le Tour.

Ridiculous? Of course, this is Le Tour after all

podium girls yellowjersey
tour devil

Day 1. Data Gathering

Winning Grand Tours requires a great team, a strong GC candidate and a lot of unquantifiable luck (i.e. not crashing into a labrador, not being run off the road into a barbed wire fence). What data, if any, would help in predicting the next TdF winner?

Pro Cycling Stats keeps track of a ton of information – General Classification (GC), special points (sprints, mountains, prologues), Tours in various parts of the world, Spring Classics performance, etc. In an effort to keep this project manageable I limited myself to about 10 individual and team stats.

ProCyclingStats GC stats

Definitely going to include GC stats…

Using Beautiful Soup I was able to scrape the stats of interest from the Pro Cycling Stats webpage. I created 2 generic py scripts – one for scraping individual data, another for scraping team data. The scripts take a URL argument so I was able to create a shell script to scrape the stats of interest. I chose to do this so I could easily add new stats pages to the analysis.

As I was looking through the data, I noticed that some stats used “.” to separate thousands and others used “.” to indicate decimals. EEInteresting. As you probably guessed, besides formatting differences, the scales of information are different. Team Distance is in tens of thousands of miles, where as a metric called “Most Efficient” was measured as “Ranking of fraction of points scored on maximum points possible.” What is the maximum number of points possible? Oh good, an Explanation link!

no info on most efficient

An excellent explanation.

It would appear that I have a bonafide Real World ™ data set on my hands!

I was delighted that I had these scripts up and running within an hour, with no prior experience using Python. I wasn’t mired down trying to find the right HTTP API for my target framework to just connect to the damn page. Compared to getting something up and running in .NET this was a breeze

The code for the scraping scripts and the shell script is on my GitHub CyclingStats

One thought on “Adventures in Data Science: Le Tour

  1. This is fabulous! I kind of want to talk to you about scraping PSC for a team website – I don’t think that they offer an API to plug into. Seems like this might be a cool way to get around it, but I’m not savvy enough to implement your scripts.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.