Based on the name of my blog (DataStrides), and the contents of this post, you might be able to guess that I am an avid endurance athlete in my spare time. So I hope to have many crossover posts that bring these two worlds together, because if there is one thing I have learned being a part of the endurance community, it is that they LOVE data.
Note: This post assumes a basic understanding of Python, HTML, and CSS.
So when would you want to use data scraping? Simply put, in any case where you want a data set that does not already exist in a clean and downloadable form, or through an API.
Let’s get started. The first thing you will want to do is import your packages. For this script we will use html from lxml, requests, bs4 (BeautifulSoup), and Pandas.
from lxml import html import requests from bs4 import BeautifulSoup import pandas as pd #Scraper goes to starting URL, gathers all of the URLs for the different years IM results. #Then, iterates through the pagination per race, and stores in a CSV / Pandas DF.
We will use html and requests to interact with the URLs / HTML code we are scraping from, BeautifulSoup is our scraping tool, and Pandas is used for data manipulation.
Next, we will pull in the starting link and transform the HTML from the site into a format that BeautifulSoup understands:
#Get starting URL in place. url = 'http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?rd=20161008#axzz4rGjY7ruv' response = requests.get(url) html = response.content
Commonly, you will not have all the data you want for your dataset on a single page. You will need to loop through pages. We can do this pretty easily with arrays, for loops, and studying how the URLs are built into the pagination of the website.
Lets take a look at our starting page. You can see on the right hand side (highlighted) all links to the different pages of race results we want. If we inspect the element (F12, or right click > ‘Inspect Element’), we can see that the links are stored in an Unordered List (UL) with an ID of “raceResults”. BeautifulSoup essentially works by using HTML and CSS elements to know where to look on the web page, and then grab whatever you want from the structure.
#Get all race links. raceLinks = [] for ul in soup.find_all('ul', {'id': 'raceResults'}): for link in ul.find_all('a', href=True): raceLinks.append(link['href'])
So essentially we go into the unordered list with an ID of raceResults, find all of the a tags (that store the links), and append the href value into our array.
There are some blank links we grab, and 2002 was giving us some issues, so lets remove them for now. We can always go get it manually later:
#get rid of blank links raceLinks = [x for x in raceLinks if x] #get rid of 2002 link, as results in bad format and break code. raceLinksFin = raceLinks[0:14]
Alright, so our URL list is clean. To add another layer of fun, each result set has N number of pages we will have to paginate through:
How will we get all of these links? You guessed it, more loops! So our outer loop will be to loop through the initial links we gathered, and our inner loop will be to loop through all of the pages. Note that all the code beyond this will be contained in this for loop.
#Loop through races to gather data mDF = pd.DataFrame() for race in raceLinksFin:
Then lets grab the date for the page we are looking for so we can tag the results in our pandas DataFrame later on. You can do this by examining the URL and seeing where the date is in the string.
#get part of URL that corresponds to date to identify which year results belong to date = race[92:]
Next we will use our packages to grab the number of pages or results for that specific race.
#Get number of pages response = requests.get(race) html = response.content soup = BeautifulSoup(html, 'lxml') numberOfPages = [] for div in soup.find_all('div', {'class': 'pagination'}): for span in div.find_all('span'): numberOfPages.append(span.get_text())
Again, we have to clean up our result set to get rid of some unwanted data. You will always want to be printing out what you are scraping during the debugging of the script so you can see what is going on.
#clean non-numerics from list of gathered data cleaned = [ x for x in numberOfPages if x.isdigit() ] #convert to int ints = [ int(x) for x in cleaned ] #get max page numbers maxPages = max(ints)
So now we have the number of pages associated with that given year of results. How can we now loop through these to scrape the data? Lets examine the links for the first few paginations:
http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?rd=20161008#axzz4rGjY7ruv
http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?p=2&rd=20161008&ps=20#axzz5VEXRJnbj
http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?p=3&rd=20161008&ps=20#axzz5VEXRJnbj
http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?p=4&rd=20161008&ps=20
We can see that the first URL is different, but for 2,3, and 4, the only thing that is changing is what I made bold above. So all we need to do is modify our base URL with the new page up to the max number of pages, and we have our URLS!
As we saw, the first URL is different, so we need to handle it differently. Feel free to look at the last screenshot which shows the HTML we are scraping.
#Starting URL, as route is different for page 1 for div in soup.find_all('div', {'class': 'pagination'}): for link in div.find_all('a', href=True): firstLink = link['href'] #build link route to loop through paginated pages part1 = firstLink[0:91] part2 = firstLink[92:]
Alright so now we have the ‘base’ link that we can loop through by doing some string manipulations to the URL and cutting out the part that references the page number. Now we get to use a super useful Pandas function called read_html, which pulls HTML tables into a list if Dataframe objects. Simply put, it makes it very simple to pull tabular data from the internet! As with BeautifulSoup, we can use the HTML / CSS attributes to specify that we want the table with the id of ‘eventResults’.
#Get data from page 1 df = pd.DataFrame() df_intial = pd.read_html(race, attrs = {'id': 'eventResults'})
Then we can append the data to the data frame we initialize:
#append data to dataframe, adding date to identify race year df = df.append(df_intial) df['Date'] = date mDF = mDF.append(df)
Finally, we loop through all the remaining racelinks we gathered for this specific year, append them to the dataset, and once all data is gathered, we go to our outer loop and move on to the next year.
#Loop through remaining pages df_2 = pd.DataFrame() i = 2 while i < maxPages: #116 securl = part1 + str(i) + part2 print(securl) df_temp = pd.read_html(securl, attrs = {'id': 'eventResults'}) #print(df) df_2 = df_2.append(df_temp) #print(df) i = i + 1 df_2['Date'] = date mDF = mDF.append(df_2)
Once these loops finish, all of our data is in the mDF DataFrame. From here, we can start doing some analysis on the DataFrame, or we can use one quick line of code to write our DataFrame to a CSV.
#write to a csv, or output to mDF.to_csv('results.csv')
So that’s that, you have your data! If you are trying to scrape some less dynamic web pages, the Pandas read_html can be a super quick way to scrape some data down.
As always, feel free to reach out with any comments or questions!
Here is the Github link to the code: https://github.com/OnyxEndurance/ironScraper