Can you awk what I sed?

Cleanup work continues on #letourpredictor data, in the mean time I’m doing some other hacking related to Le Tour.

Yesterday Le Tour lost yet another big contender, Chris Froome. Perhaps Team Sky made a premature call leaving Bradley Wiggins on the sidelines, as they now have no big GC hope for Le Tour. Froome crashed out twice in yesterday’s stage over the cobbles, the second time winding up in the team car and officially abandoning the race

Heal soon Mr. Froome

Heal soon Mr. Froome

I have an interest in Twitter data. For the 2012 US general elections I wrote a program to monitor how major US news media outlets were calling the races. I’m also working on a D3 visualization of Portland’s snowpocalypse back in Feb of this year.

For both of these projects I wrote custom scripts to gather twitter data. Since capturing real time Twitter activity is an ongoing interest Ive wanted to create a generic script that I could fire off at will, so I figured this was a good opportunity to get started on that and continue learning python.



Chris Froome abandoning is a Pretty Big Deal, so I decided to capture the Twitterverse around this event. Time was against me, as I would have had to be awake at 5:30am PDT to even witness his Tour ending crash let alone that I had 0 code to capture to Twitter stream anyway.

So at about 7:30am PDT I sat down with some coffee, watching the rainy stage of reckoning come to an end, and started a mini coding marathon.

Shut up fingers!

Shut up fingers!

Since Id missed the stream, I setup a script to start capturing the tweets from the current time back to the time of the crash that ended Froome’s bid. I used the Twython package and it worked like a charm. Again, Ive really been impressed with how pleasant coding can be outside of a behemoth framework like .NET. I’m not slamming .NET, more that as a relatively new coder thats all I’d been exposed to. While its great for many things I’ve felt it can get in the way when I’m trying to do something quick and dirty like.

Speaking of dirty

Speaking of dirty

I made a new database to suck the tweets into and started querying twitter. The code is here.

I sent a search query to twitter for “froome” and captured things I thought might be interesting to visualize – the tweet text, geo location data if any, language. There isnt a way (that I am aware of) to tell twitter you only want tweets from a certain time; you have to use the max_id field. max_id tells twitter that you only want tweets that occurred BEFORE the id specified; so I would take the id from the last tweet in the current data set to use for the max_id in my next query. Given rate limiting, and that I ultimately collected 40,000 tweets getting back to the crash, I had plenty of time to stretch my fingers, make lunch, knit an afghan, etc

Thats a lot of Tweets

Thats a lot of Tweets

OK great, so where do sed and awk come in?

I’m getting there!

Initially I wanted to make one of those super cool world heat maps that showed, in timeline form, how people around the world reacted to Froome’s departure.

Such as this sexy beast

Such as this sexy beast

As I looked through the data however, I noticed that the geo data was mostly missing. The chart below shows the number of tweets when Froome abandoned, next to the number of tweets that had either “geo” or “coordinates” specified. Less than 5% of the tweets had geo data!

Tweets around when Froome abandoned - 4000 tweets, but less than 5% had location data!

While for many of these tweets the “location” field was specified I didnt want to try and convert text to geo locale. For starters I figured I’d just make a simple line graph corresponding to the tweet volume around the events. Screen shot is below, click here to view an interactive version.

Time in UTC

Time in UTC

Having poked around with D3 a little I decided I wanted to investigate some prepackaged options to quickly generate this basic graph instead of coding it from scratch. I stumbled upon this site and ended up giving the Google Chart API a whirl.

(FYI, sounds like Google Charts support is ending, but the API will live on)

This is where sed an awk come in!

I have run into sed and awk occasionally in my *nix adventures. I like slick one liners so when I had to format my data for the Google Chart API I turned there.

Google has a very cool playground for testing out your chart. I basically just pasted my data into the Line Chart example and made some tweaks to the visuals and I was all set.

The arrayToDataTable function takes tuples in this format

[ “x label”, “y label”],
[ “x data1”, “y data1”],
[ “x data2”, “y data2”]….

My data looked like

6, 12.14,
2, 12.15,…

So I needed to swap the columns since ‘tweets’ was my Y data, convert the “.” to “:” for the time, surround the time in quotes, and add the brackets. Sed & Awk to the rescue!

No case to big, no case to small. When you need help just call..

No case too big, no case too small. When you need help just call..

I saved the initial data to a file “text2.txt.” To swap the “.” to “:” –

sed s/’\.’/:/ text2.txt > text4.txt

And then, in 1 line with awk I was able to do the rest of the transformations!

awk -F, ‘{print “\[\”” $2 “\”,” $1 “],”}’ text4.txt > text5.txt

resulting in


Ooops! I dont want all those quotes around “‘time'” so Ill tell awk to do something different with the first line

awk -F, ‘{if (NR!=1) {print “\[\”” $2 “\”,” $1 “],”} else {print “\[” $2 “,” $1 “],”}}’ text4.txt > text5.txt


Ahh, thats better!

Id like to do something more interesting than a line graph with my #froome data.  I’m guessing sentiment analysis will be predominantly bad, so perhaps not that. 😉 I may make an attempt at mapping the data, or perhaps following the RT trail from the BBC Sports initial tweet of Froome’s abandonment. Until next time!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.