So after much data scraping and cleaning I’m finally ready to start analyzing.. by which I mean discover how much more cleaning I need to do 😉
I wanted to add a word on some of my data gathering techniques. I’ve been primarily scraping the data from ProCyclingStats but sometimes I come across data that isnt as scraper friendly. Case in point, 2013 TdF results:
Thats a lot of span! I had been using next_element to parse the start list data (you can find the code for that here) but this particular piece seemed outside the 80/20 spectrum. I manually scraped and formatted the data which, while tedious, seemed reasonable to me since it is a limited, immutable set. That being said the html, while span laden, has a defined structure so if I feel inspired to write yet another scraping script I could probably do so.
Some other things I learned are loading sql scripts in psql, which is useful if you want to forgo the whole connect to db thing in your scraping code:
I used this method to insert the team roster data. I also used it to keep track of queries to the data set!!
Which looks like this for the 2013 Tour results!
Its been about 14 years since I did any SQL joins (not a lot of SQL in hardware design) and I found this oldie but goodie an excellent refresher. Also, who doesn’t love a good Venn diagram?