Can you awk what I sed?

Cleanup work continues on #letourpredictor data, in the mean time I’m doing some other hacking related to Le Tour.

Yesterday Le Tour lost yet another big contender, Chris Froome. Perhaps Team Sky made a premature call leaving Bradley Wiggins on the sidelines, as they now have no big GC hope for Le Tour. Froome crashed out twice in yesterday’s stage over the cobbles, the second time winding up in the team car and officially abandoning the race

Heal soon Mr. Froome

Heal soon Mr. Froome

I have an interest in Twitter data. For the 2012 US general elections I wrote a program to monitor how major US news media outlets were calling the races. I’m also working on a D3 visualization of Portland’s snowpocalypse back in Feb of this year.

For both of these projects I wrote custom scripts to gather twitter data. Since capturing real time Twitter activity is an ongoing interest Ive wanted to create a generic script that I could fire off at will, so I figured this was a good opportunity to get started on that and continue learning python.



Chris Froome abandoning is a Pretty Big Deal, so I decided to capture the Twitterverse around this event. Time was against me, as I would have had to be awake at 5:30am PDT to even witness his Tour ending crash let alone that I had 0 code to capture to Twitter stream anyway.

So at about 7:30am PDT I sat down with some coffee, watching the rainy stage of reckoning come to an end, and started a mini coding marathon.

Shut up fingers!

Shut up fingers!

Since Id missed the stream, I setup a script to start capturing the tweets from the current time back to the time of the crash that ended Froome’s bid. I used the Twython package and it worked like a charm. Again, Ive really been impressed with how pleasant coding can be outside of a behemoth framework like .NET. I’m not slamming .NET, more that as a relatively new coder thats all I’d been exposed to. While its great for many things I’ve felt it can get in the way when I’m trying to do something quick and dirty like.

Speaking of dirty

Speaking of dirty

I made a new database to suck the tweets into and started querying twitter. The code is here.

I sent a search query to twitter for “froome” and captured things I thought might be interesting to visualize – the tweet text, geo location data if any, language. There isnt a way (that I am aware of) to tell twitter you only want tweets from a certain time; you have to use the max_id field. max_id tells twitter that you only want tweets that occurred BEFORE the id specified; so I would take the id from the last tweet in the current data set to use for the max_id in my next query. Given rate limiting, and that I ultimately collected 40,000 tweets getting back to the crash, I had plenty of time to stretch my fingers, make lunch, knit an afghan, etc

Thats a lot of Tweets

Thats a lot of Tweets

OK great, so where do sed and awk come in?

I’m getting there!

Initially I wanted to make one of those super cool world heat maps that showed, in timeline form, how people around the world reacted to Froome’s departure.

Such as this sexy beast

Such as this sexy beast

As I looked through the data however, I noticed that the geo data was mostly missing. The chart below shows the number of tweets when Froome abandoned, next to the number of tweets that had either “geo” or “coordinates” specified. Less than 5% of the tweets had geo data!

Tweets around when Froome abandoned - 4000 tweets, but less than 5% had location data!

While for many of these tweets the “location” field was specified I didnt want to try and convert text to geo locale. For starters I figured I’d just make a simple line graph corresponding to the tweet volume around the events. Screen shot is below, click here to view an interactive version.

Time in UTC

Time in UTC

Having poked around with D3 a little I decided I wanted to investigate some prepackaged options to quickly generate this basic graph instead of coding it from scratch. I stumbled upon this site and ended up giving the Google Chart API a whirl.

(FYI, sounds like Google Charts support is ending, but the API will live on)

This is where sed an awk come in!

I have run into sed and awk occasionally in my *nix adventures. I like slick one liners so when I had to format my data for the Google Chart API I turned there.

Google has a very cool playground for testing out your chart. I basically just pasted my data into the Line Chart example and made some tweaks to the visuals and I was all set.

The arrayToDataTable function takes tuples in this format

[ “x label”, “y label”],
[ “x data1”, “y data1”],
[ “x data2”, “y data2”]….

My data looked like

6, 12.14,
2, 12.15,…

So I needed to swap the columns since ‘tweets’ was my Y data, convert the “.” to “:” for the time, surround the time in quotes, and add the brackets. Sed & Awk to the rescue!

No case to big, no case to small. When you need help just call..

No case too big, no case too small. When you need help just call..

I saved the initial data to a file “text2.txt.” To swap the “.” to “:” –

sed s/’\.’/:/ text2.txt > text4.txt

And then, in 1 line with awk I was able to do the rest of the transformations!

awk -F, ‘{print “\[\”” $2 “\”,” $1 “],”}’ text4.txt > text5.txt

resulting in


Ooops! I dont want all those quotes around “‘time'” so Ill tell awk to do something different with the first line

awk -F, ‘{if (NR!=1) {print “\[\”” $2 “\”,” $1 “],”} else {print “\[” $2 “,” $1 “],”}}’ text4.txt > text5.txt


Ahh, thats better!

Id like to do something more interesting than a line graph with my #froome data.  I’m guessing sentiment analysis will be predominantly bad, so perhaps not that. 😉 I may make an attempt at mapping the data, or perhaps following the RT trail from the BBC Sports initial tweet of Froome’s abandonment. Until next time!


A post about data denormalization and uncovering dirty data.

Thats Not Normal.

Thats Not Normal.

Lets chat denormalization! Here’s a list of the tables in my database. Ive collected stats from the 2013 TdF for the Yellow (GC), Green (Points) and Polka Dot (Mountains) jersey competitions – the dependent variables. The other tables have stats on rider performance in specific areas throughout the season which I’m hoping will give some indication of TdF performance

Totally tabular!

Totally tabular!

Now that Ive got a big ole database full of stats its time to denormalize the data for analysis. You can find the SQL script I used here. At the time of this writing the script was recently updated to include adding 0s for null points, thanks to this StackOverflow article for help on how to provide default values for nulls.

I chose to use left outer joins from the tdf_gc table based on rider name. This means I will have a table with rows for every rider from the tdf_gc table with columns from the other stats tables I join. If a rider doesnt have stats in a given table (i.e., I wouldnt expect sprinter Mark Cavendish to have an entry in the individual_mountains table) then a null is placed in that column.

Kind of an eye chart, but if you click on that image you will notice that Joaquim Rodriguez is missing quite a lot of information. Doesnt make sense that someone ranked so high in the TdF GC doesnt have any entry in the season GC or PCS Rank.

Thats Not Normal.

Poking around I discovered the culprit!


RODRíGUEZ, RODRÍGUEZ! Or, international case study

Just use UPPER() or LOWER()! Not so fast podnah, that “Í” is a character especial. Fortunately I’m using Postgres 9.3 so I can specify locale on a per query basis.

Using Collate to specify locale

Using “collate” to specify locale

That works for this case, but what about the rest of the riders from other countries? Apart from inspecting each rider manually I am presently unaware of another method for doing case conversion using a dynamic locale. Perhaps I would have benefited from screening the data for such characters especial before inserting them into my database.

Lance Armstrong

I notice that another rider, Daniel Garcia Navarro, is also missing similar data to Rodriguez. Wondering if the same internationalization issue is to blame I check it out.

Le Sigh.

Le Sigh.

An internationalization issue indeed! Some of my data includes both “last names” for Navarro while other data does not. Sadly there is no SQL entity for screening out irregularities in latin last name conventions vs middle names vs multiple first names.

Remember in my last post when I said I probably have more data cleaning to do?

sad Lance

Ive heard it said that data science is primarily about getting the data clean, so I’ll step away from the mic to work some more on the data set. I’m disappointed that I wont have a model to help me pick my Fantasy Tour de France team, but at least I have potato salad and apple pie to console me today. And unicorns, fireworks, etc

Happy Independence Day US!

Happy Birfday #merica!

Much SQL! So excite!

So after much data scraping and cleaning I’m finally ready to start analyzing.. by which I mean discover how much more cleaning I need to do 😉

Lets have a good, clean race

Lets have a good, clean race

I wanted to add a word on some of my data gathering techniques. I’ve been primarily scraping the data from ProCyclingStats but sometimes I come across data that isnt as scraper friendly. Case in point, 2013 TdF results:

Mmmm span

Mmmm span

Thats a lot of span! I had been using next_element to parse the start list data (you can find the code for that here) but this particular piece seemed outside the 80/20 spectrum. I manually scraped and formatted the data which, while tedious, seemed reasonable to me since it is a limited, immutable set. That being said the html, while span laden, has a defined structure so if I feel inspired to write yet another scraping script I could probably do so.

Some other things I learned are loading sql scripts in psql, which is useful if you want to forgo the whole connect to db thing in your scraping code:


\i is your friend

I used this method to insert the team roster data.  I also used it to keep track of queries to the data set!!


Which looks like this for the 2013 Tour results!

Is KOM an indicator of GC ?

Is KOM an indicator of GC ?

Its been about 14 years since I did any SQL joins (not a lot of SQL in hardware design) and I found this oldie but goodie an excellent refresher. Also, who doesn’t love a good Venn diagram?

Hooray for Venn Diagrams!

Hooray for Venn Diagrams!


Lions and Cyclists and Bears

Lions and Cyclists and Bears

Back from a hiatus (vacation) I wanted to post a quick update on my project

Rosters for Le Tour have been announced and I have been hard at work collecting datas. I updated my GitHub with a scraping script for team roster data from ProCyclingStats. I finally have all the stats of interest in my database from 2013 and 2014.

If you want to play along you can create your own Fantasy Tour de France team. Besides trying to predict the winner I may use the model to help pick my team, assuming its ready before registration closes.

Now that all the team and individual data is populated the next step is denormalizing for analysis. Stay tuned!

Le Tour: King of the Mountain

So now that I have a mountain of data, what shall I do with it?

KOM of Data

KOM of Data

Recalling that I have data sets for individual riders and teams, each individual data set contains the top 100 riders in that category. Riders that are good at mountains are typically not good at sprints, so I have different riders in each data set. How do I create a list of riders to track for the TdF?

Historically, the winner of the TdF has been ranked highly in the General Classification (GC), so that would be a good place to start. As I mentioned, the supporting actors (domestiques in cycling language) play a key part as well. Teams announce their Tour de France rosters shortly before the event (as of this writing, teams have yet to announce their final rosters, though we know Sir Bradley Wiggins will most likely not be participating this year, much to his chagrin.) so we dont know yet who will be riding.

Pulling in the top riders (individual GC) for the top ranked teams (team PCS ranking) should give a reasonable approximation for an initial data set, and when team rosters are announced I can cull those from the herd that wont be making the cut.

Maybe next year Wiggo

Maybe next year Wiggo

I imported the .tsv files created by the scraping scripts into a PostgreSQL database. I was curious about the varchar vs text tradeoffs and found this article useful. I decided to leave the data normalized as it came to me from PCS, so each metric has its own table.

An important note – presently I’m looking at data from this year. If I’m going to build a predictive model I will need to have a training data set where I know the outcome for the dependent variable – the winner. Fortunately I don’t have to fabricate one; PCS has the metrics from last year and we already know who won Le Tour. It would be preferable to have several years of examples, or to create a few more training examples based on the data we have, but for now I’m just planning on using the data from 2013. Since the database will have information spanning multiple years I added a “year” field to all the metrics.

What metric keeps track of contaminated meat consumption?

What metric keeps track of contaminated meat consumption?

Adventures in Data Science: Le Tour

Le Tour de France. A 3 week torture fest featuring svelte men in spandex rolling along the French countryside. Armed with (French) pastries, I enjoy tuning in ridiculously early in the morning to watch this soap opera on wheels.

Recently I stumbled across Pro Cycling Stats. It seemed to be a perfect intersection for my interests in cycling and data science, so I hatched a little project to see if I could predict the winner of Le Tour.

Ridiculous? Of course, this is Le Tour after all

podium girls yellowjersey
tour devil

Day 1. Data Gathering

Winning Grand Tours requires a great team, a strong GC candidate and a lot of unquantifiable luck (i.e. not crashing into a labrador, not being run off the road into a barbed wire fence). What data, if any, would help in predicting the next TdF winner?

Pro Cycling Stats keeps track of a ton of information – General Classification (GC), special points (sprints, mountains, prologues), Tours in various parts of the world, Spring Classics performance, etc. In an effort to keep this project manageable I limited myself to about 10 individual and team stats.

ProCyclingStats GC stats

Definitely going to include GC stats…

Using Beautiful Soup I was able to scrape the stats of interest from the Pro Cycling Stats webpage. I created 2 generic py scripts – one for scraping individual data, another for scraping team data. The scripts take a URL argument so I was able to create a shell script to scrape the stats of interest. I chose to do this so I could easily add new stats pages to the analysis.

As I was looking through the data, I noticed that some stats used “.” to separate thousands and others used “.” to indicate decimals. EEInteresting. As you probably guessed, besides formatting differences, the scales of information are different. Team Distance is in tens of thousands of miles, where as a metric called “Most Efficient” was measured as “Ranking of fraction of points scored on maximum points possible.” What is the maximum number of points possible? Oh good, an Explanation link!

no info on most efficient

An excellent explanation.

It would appear that I have a bonafide Real World ™ data set on my hands!

I was delighted that I had these scripts up and running within an hour, with no prior experience using Python. I wasn’t mired down trying to find the right HTTP API for my target framework to just connect to the damn page. Compared to getting something up and running in .NET this was a breeze

The code for the scraping scripts and the shell script is on my GitHub CyclingStats