DataStrides

Web Scraping Ironman Triathlon Results to a CSV using BeatifulSoup and Pandas

October 31, 2018October 31, 2018 rtken1023Leave a comment

Based on the name of my blog (DataStrides), and the contents of this post, you might be able to guess that I am an avid endurance athlete in my spare time. So I hope to have many crossover posts that bring these two worlds together, because if there is one thing I have learned being a part of the endurance community, it is that they LOVE data.

Note: This post assumes a basic understanding of Python, HTML, and CSS.

So when would you want to use data scraping? Simply put, in any case where you want a data set that does not already exist in a clean and downloadable form, or through an API.

Let’s get started. The first thing you will want to do is import your packages. For this script we will use html from lxml, requests, bs4 (BeautifulSoup), and Pandas.

from lxml import html
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Scraper goes to starting URL, gathers all of the URLs for the different years IM results.
#Then, iterates through the pagination per race, and stores in a CSV / Pandas DF.

We will use html and requests to interact with the URLs / HTML code we are scraping from, BeautifulSoup is our scraping tool, and Pandas is used for data manipulation.

Next, we will pull in the starting link and transform the HTML from the site into a format that BeautifulSoup understands:

#Get starting URL in place.
url = 'http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?rd=20161008#axzz4rGjY7ruv'
response = requests.get(url)
html = response.content

Commonly, you will not have all the data you want for your dataset on a single page. You will need to loop through pages. We can do this pretty easily with arrays, for loops, and studying how the URLs are built into the pagination of the website.

Lets take a look at our starting page. You can see on the right hand side (highlighted) all links to the different pages of race results we want. If we inspect the element (F12, or right click > ‘Inspect Element’), we can see that the links are stored in an Unordered List (UL) with an ID of “raceResults”. BeautifulSoup essentially works by using HTML and CSS elements to know where to look on the web page, and then grab whatever you want from the structure.

#Get all race links.
raceLinks = []
for ul in soup.find_all('ul', {'id': 'raceResults'}):
    for link in ul.find_all('a', href=True):
        raceLinks.append(link['href'])

So essentially we go into the unordered list with an ID of raceResults, find all of the a tags (that store the links), and append the href value into our array.

There are some blank links we grab, and 2002 was giving us some issues, so lets remove them for now. We can always go get it manually later:

#get rid of blank links
raceLinks = [x for x in raceLinks if x]
#get rid of 2002 link, as results in bad format and break code.
raceLinksFin = raceLinks[0:14]

Alright, so our URL list is clean. To add another layer of fun, each result set has N number of pages we will have to paginate through:

How will we get all of these links? You guessed it, more loops! So our outer loop will be to loop through the initial links we gathered, and our inner loop will be to loop through all of the pages. Note that all the code beyond this will be contained in this for loop.

#Loop through races to gather data
mDF = pd.DataFrame()
for race in raceLinksFin:

Then lets grab the date for the page we are looking for so we can tag the results in our pandas DataFrame later on. You can do this by examining the URL and seeing where the date is in the string.

 #get part of URL that corresponds to date to identify which year results belong to
    date = race[92:]

Next we will use our packages to grab the number of pages or results for that specific race.

 #Get number of pages
    response = requests.get(race)
    html = response.content
    soup = BeautifulSoup(html, 'lxml')
    numberOfPages = []
    for div in soup.find_all('div', {'class': 'pagination'}):
        for span in div.find_all('span'):
            numberOfPages.append(span.get_text())

Again, we have to clean up our result set to get rid of some unwanted data. You will always want to be printing out what you are scraping during the debugging of the script so you can see what is going on.

 #clean non-numerics from list of gathered data
    cleaned = [ x for x in numberOfPages if x.isdigit() ]
    #convert to int
    ints = [ int(x) for x in cleaned ]
    #get max page numbers
    maxPages = max(ints)

So now we have the number of pages associated with that given year of results. How can we now loop through these to scrape the data? Lets examine the links for the first few paginations:
http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?rd=20161008#axzz4rGjY7ruv
http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?p=2&rd=20161008&ps=20#axzz5VEXRJnbj
http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?p=3&rd=20161008&ps=20#axzz5VEXRJnbj
http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?p=4&rd=20161008&ps=20

We can see that the first URL is different, but for 2,3, and 4, the only thing that is changing is what I made bold above. So all we need to do is modify our base URL with the new page up to the max number of pages, and we have our URLS!

As we saw, the first URL is different, so we need to handle it differently. Feel free to look at the last screenshot which shows the HTML we are scraping.

#Starting URL, as route is different for page 1
    for div in soup.find_all('div', {'class': 'pagination'}):
        for link in div.find_all('a', href=True):
            firstLink = link['href']
            
            #build link route to loop through paginated pages
            part1 = firstLink[0:91]
            part2 = firstLink[92:]

Alright so now we have the ‘base’ link that we can loop through by doing some string manipulations to the URL and cutting out the part that references the page number. Now we get to use a super useful Pandas function called read_html, which pulls HTML tables into a list if Dataframe objects. Simply put, it makes it very simple to pull tabular data from the internet! As with BeautifulSoup, we can use the HTML / CSS attributes to specify that we want the table with the id of ‘eventResults’.

#Get data from page 1
    df = pd.DataFrame()
    df_intial = pd.read_html(race, attrs = {'id': 'eventResults'})

Then we can append the data to the data frame we initialize:

#append data to dataframe, adding date to identify race year
    df = df.append(df_intial)
    df['Date'] = date
    mDF = mDF.append(df)

Finally, we loop through all the remaining racelinks we gathered for this specific year, append them to the dataset, and once all data is gathered, we go to our outer loop and move on to the next year.

#Loop through remaining pages
    df_2 = pd.DataFrame()
    i = 2
    while i < maxPages: #116
        securl = part1 + str(i) + part2  
        print(securl)
        df_temp = pd.read_html(securl, attrs = {'id': 'eventResults'})
        #print(df)
        df_2 = df_2.append(df_temp)
        #print(df)
        i = i + 1
    df_2['Date'] = date
    mDF = mDF.append(df_2)

Once these loops finish, all of our data is in the mDF DataFrame. From here, we can start doing some analysis on the DataFrame, or we can use one quick line of code to write our DataFrame to a CSV.

#write to a csv, or output to     
mDF.to_csv('results.csv')

So that’s that, you have your data! If you are trying to scrape some less dynamic web pages, the Pandas read_html can be a super quick way to scrape some data down.

As always, feel free to reach out with any comments or questions!

Here is the Github link to the code: https://github.com/OnyxEndurance/ironScraper

Azure Data Warehouse Workaround – VALUES constructor to grab MIN / MAX value across several columns.

October 13, 2018October 13, 2018 rtken1023Leave a comment

After having worked with Azure Data Warehouse for over 6 months, I have come to learn a lot of tricks about how to make things more performant, how to work around limitations, etc. Thus, I will try to document these workarounds here as frequently as I can. Though I will note, with the recent announcement of SQL Server 2019, many of these limitations will no longer be an issue as Azure SQL DW will support all of the standard SQL Server features.

This is a good read: https://www.microsoft.com/en-us/sql-server/sql-server-2019

Please note that due to the rapid Azure release cycle, certain things I mention may get fixed sometime in the near future. I will always link the Microsoft Docs to whatever problem I am working around, and you can start by clicking there to see if Azure DW is now supported for whatever topic I am discussing.

For today’s blog, the VALUES constructor. As you can see, this is currently NOT supported by Azure Data Warehouse:

https://docs.microsoft.com/en-us/sql/t-sql/queries/table-value-constructor-transact-sql?view=sql-server-2017

values - 1

The most common use case for this constructor is inserting some records into a table. Unfortunately, there is not really a great workaround for this, other than just having a whole bunch of insert statements, rather than just comma separating the values:

Values - 2

However, this is not the use case I am going to go into – we can save that for another blog post. Rather, I want to discuss point C on the Docs page, which is something I find myself frequently using in SQL Server:

Values - 3

Here is an example for getting the minimum date from several different columns using this logic:

,(SELECT MIN(f) FROM (VALUES (DateTime1), 
                             (DateTime2), 
                             (DateTime3), 
                             (DateTime4) AS Datetimes(f)) AS  MinField

This logic is neat, easy to read, and the most efficient way to grab the MIN value spread across multiple columns. I constantly use this when there is some logic where I need to grab the MIN amongst n column contenders. However, this logic is not supported (currently) in Azure Data Warehouse. Instead, simply use the following in your select statement.

    ,(SELECT MIN(DateTime) FROM (
        SELECT DateTime1 as DateTime
        UNION
        SELECT DateTime2 as DateTime
        UNION
        SELECT DateTime3 as DateTime 
        UNION 
        SELECT DateTime4 as DateTime) subquery
     ) AS MinDateTime

You can use as many Unions as you want to add additional columns, and the logic is just a hair more expensive than the VALUES constructor you are used to on SQL Server.

Note that there are other methods of achieving this same result, but I have found this to be the most performant and the easiest to read.

Even though SQL Server 2019 will be resolving this limitation, I hope you still find these few SQL snippets useful!

Feel free to leave a comment or question!

First Post – Identifying Duplicates and Fixing Incremental Merge Errors

September 19, 2018October 13, 2018 rtken10231 Comment

Figured I would start small with my first post, and share a simple TSQL query that I often find myself using to identify duplicates. If you have ever written an incremental merge script, you have likely gotten an error like the following:

merge_error

What this essentially means is that either in your target dataset or in your source dataset / query, you have a duplicate record on the key field you are merging on. Thus, the code is trying to update the same record multiple times, which causes an error. Take this simple table:

base_table

As you can see, there are duplicate OrderIDs in the above table. The code snippet below is one very quick way to identify duplicates, and will return any OrderIDs that are duplicated. You can run it on either the source or the target, just make sure whatever you use in the group by is the key you are merging on, or the field that you are trying to identify as duplicated.

SELECT OrderID
FROM #Orders
GROUP BY OrderID
HAVING COUNT(OrderID) > 1

identify_duplicates

You now know that OrderID 1 is duplicated in the table. If you want to find duplicates on more than one column, just add the additional columns to the select and Group By statements.

SELECT OrderID, ItemID
FROM #Orders
GROUP BY OrderID, ItemID
HAVING COUNT(OrderID) > 1

identify_duplicates_multiple

Now we know there are multiple records with OrderID of 1 and ItemID of 123. Then, you can take one of the returned OrderIDs, and try to analyze why this row was duplicated. Run a simple select (below) on the location of the duplicates using one of the keys you identified above to check out the data. This will return the duplicated data.

SELECT *
FROM #Orders
where OrderID = 1 --This is an identified duplicate.

select_duplicates

You can tell if the rows are exact duplicates by adding DISTINCT to the above query. If only one row is returned, you know the records are exactly the same, not just two rows with the same key.

SELECT DISTINCT *
FROM #Orders
WHERE OrderID = 1

select_distinct

Since one row disappeared out of the 3, it means two of the rows were exact duplicates. Your next step is to figure out why you are getting them in your result set, and whether it is valid to have duplicates on this key. If you are doing a simple select from a table, it means your raw table has duplicates. This becomes a question of how the table was populated, and whether or not duplicates should be allowed in this table. The above example could either be that one person bought two helmets in a single order, or there could be some sort of bug in whatever code populates the table. The resolution will all depend on the granularity, or ‘level’ of the data you are trying to get, which I will discuss more later.

If the rows are not exact duplicates, then it means that some field is changing over the result set. Again, the first step is to try to figure out if these duplicates are valid. If your ETL source is a complex query, it could be an N:1 join which is causing duplication. In order to find the culprit, your goal should be to identify which column is changing through the rows of your SELECT * query. Once you find the field, it is likely the join to get that field, or just the nature of the raw table which is causing the duplication. Above, we can identify the ‘ItemID’ as the culprit. It is likely that you have an N:1 relationship, where many items can be in one order. However, given the above data, we also see that ItemID and OrderID can also be duplicated, so simply adding the ItemID to the MERGE key wouldn’t solve the problem. You would have to find some other field.

Long story short – if you are doing incremental ETL, Always make sure your MERGE Key is unique! While you are doing your analysis, you should not just add fields to the merge key to make it unique. Everything should be done purposefully, and you should always be mindful of what ‘Level’ or granularity your data is at. Lets take a real world healthcare scenario:

Say you are sourcing your data from an EMR system (Electronic Medical Record). You might find a many to one relationship between procedures undertaken in a surgery and the surgery itself. That is, there can be multiple procedures done in a single surgery. When designing your ETL and your data model, you need to decide what ‘level’ you want the surgeries to exist at. Should your target table be at the ‘Surgery’ level, or the ‘Procedure’ level? Take this data set:

surgeries_select

Do you want to roll them up into one record, and expand your columns to have additional fields for each procedure? This would be the surgery level, but likely won’t work unless you can only have a set X number of procedures associated to a surgery. This is generally not a good idea.

surgeries_pivot

Do you want to just insert the ‘primary’ procedure? Is there some bit flag on the source table so you can filter on just these procedures? This is again the surgery level.

surgeries_isPrimary

Do you want to add the ProcedureID to the merge key / primary key so you can keep the records at the procedure level and avoid issues on the merge key? This would keep the data at the procedure level. Your code might look something like this:

MERGE [datawarehouse].[Surgeries] AS TARGET
USING (
	SELECT [SurgeryID]
		,[ProcedureID]
		,[SurgeryDate]
		,[IsPrimary]
	FROM [emr].[SurgeryActuals]
	) AS Source
	ON Target.[SurgeryID] = Source.[SurgeryID]
		AND Target.[ProcedureID] = Source.[ProcedureID]
WHEN MATCHED
	THEN
		UPDATE
		SET Target.[SurgeryID] = Source.[SurgeryID]
			,Target.[ProcedureID] = Source.[ProcedureID]
			,Target.[SurgeryDate] = Source.[SurgeryDate]
			,Target.[IsPrimary] = Source.[IsPrimary]
WHEN NOT MATCHED
	THEN
		INSERT
		VALUES (
			Source.[SurgeryID]
			,Source.[ProcedureID]
			,Source.[SurgeryDate]
			,Source.[IsPrimary]
			);

As always, it will likely be up to the data consumers for what information they want displayed. But when designing a Data Warehouse or Data Mart, I always err on the side of caution and include more data rather than less. It can always be filtered down later.

Feel free to comment with any other tips / tricks / experience you have had with the above!