Much SQL! So excite!

So after much data scraping and cleaning I’m finally ready to start analyzing.. by which I mean discover how much more cleaning I need to do 😉

Lets have a good, clean race

Lets have a good, clean race

I wanted to add a word on some of my data gathering techniques. I’ve been primarily scraping the data from ProCyclingStats but sometimes I come across data that isnt as scraper friendly. Case in point, 2013 TdF results:

Mmmm span

Mmmm span

Thats a lot of span! I had been using next_element to parse the start list data (you can find the code for that here) but this particular piece seemed outside the 80/20 spectrum. I manually scraped and formatted the data which, while tedious, seemed reasonable to me since it is a limited, immutable set. That being said the html, while span laden, has a defined structure so if I feel inspired to write yet another scraping script I could probably do so.

Some other things I learned are loading sql scripts in psql, which is useful if you want to forgo the whole connect to db thing in your scraping code:

slashi

\i is your friend

I used this method to insert the team roster data.  I also used it to keep track of queries to the data set!!

statssql

Which looks like this for the 2013 Tour results!

Is KOM an indicator of GC ?

Is KOM an indicator of GC ?

Its been about 14 years since I did any SQL joins (not a lot of SQL in hardware design) and I found this oldie but goodie an excellent refresher. Also, who doesn’t love a good Venn diagram?

Hooray for Venn Diagrams!

Hooray for Venn Diagrams!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.