So now that I have a mountain of data, what shall I do with it?
Recalling that I have data sets for individual riders and teams, each individual data set contains the top 100 riders in that category. Riders that are good at mountains are typically not good at sprints, so I have different riders in each data set. How do I create a list of riders to track for the TdF?
Historically, the winner of the TdF has been ranked highly in the General Classification (GC), so that would be a good place to start. As I mentioned, the supporting actors (domestiques in cycling language) play a key part as well. Teams announce their Tour de France rosters shortly before the event (as of this writing, teams have yet to announce their final rosters, though we know Sir Bradley Wiggins will most likely not be participating this year, much to his chagrin.) so we dont know yet who will be riding.
Pulling in the top riders (individual GC) for the top ranked teams (team PCS ranking) should give a reasonable approximation for an initial data set, and when team rosters are announced I can cull those from the herd that wont be making the cut.
I imported the .tsv files created by the scraping scripts into a PostgreSQL database. I was curious about the varchar vs text tradeoffs and found this article useful. I decided to leave the data normalized as it came to me from PCS, so each metric has its own table.
An important note – presently I’m looking at data from this year. If I’m going to build a predictive model I will need to have a training data set where I know the outcome for the dependent variable – the winner. Fortunately I don’t have to fabricate one; PCS has the metrics from last year and we already know who won Le Tour. It would be preferable to have several years of examples, or to create a few more training examples based on the data we have, but for now I’m just planning on using the data from 2013. Since the database will have information spanning multiple years I added a “year” field to all the metrics.