Introduction
Are you a data analyst that has heard of Python, everything it is capable of, and would like to try it but you don’t even know where to get started? Then this article is for you. This is the third in a series of articles to help data analysts get started with using Python. In this article we will cover how to connect to data found on websites. If you do not already have Python installed, check out the first post in this series: Python for Data Analysts: Getting Started.
Before Accessing Data
Before you can access data, let’s cover a few basic concepts around Python that you need to be aware of before proceeding. This series walks you through getting started with and using Python step-by-step by providing code and Jupyter Notebooks, so you don’t have to be an expert at Python or understand everything to follow along. Take it slow and learn by example.
Libraries
Libraries are the bread and butter of Python. You can utilize libraries by importing them. For example:
import pandas as pd
If you followed the tutorial from the first post in this series, then you should already have Anaconda installed. Anaconda by default comes with a lot of useful libraries installed; however, you can also install them. There are a few ways to do this, but in this series, we will focus on accomplishing everything possible from Jupyter Notebooks. In a Notebook you would use the same command as in the Command Prompt, except you put a ‘!’
in front of it:
! pip install pandas
For more information on the pandas
library, check out the official documentation.
Terminology
Below are some key terms used throughout this post, and others in the series, that are important to be aware of. These key terms may or may not be familiar to you, depending on your previous experience with coding and data analysis; therefore, I am including a list of terms used and their definitions in relation to this post so as to aid in your development and understanding.
- Object: An object is something you assign an entity to.
- Entity: An entity is something like a list, table, string, number, etc.
-
Assign: You assign entities to objects by using
‘=’
. -
Index: In Python, indexes start at
0
. - Function: A call to Python to perform an action based on the arguments provided.
- Arguments: Information passed to a function in order for the function to work as intended.
Data on a Website
So, there is a web page with data you want to play with or report on? You can do that with Python! There are many ways to accomplish this, just like with anything in Python. In this post we will be covering how to get this data using the pandas
library.
Notes:
- This method only works with websites written using HTML that also have tables on them.
- The website must be public access, otherwise you will get an error saying you don’t have access.
- For this I am pulling data from a Wikipedia page which contains a table of US States, along with some other tables.
- You can access any public access HTML site you would like; you do not have to use the site I am using.
Step 1:
In order to get started, import the pandas
library into your Jupyter Notebook:
import pandas as pd
Step 2:
Assign the web page you would like to gather data from to an object. As an example, I am going to pull in a table of US States from Wikipedia.
url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States#:~:text=States%20of%20the%20United%20States%20of%20America%20,Feb%2014%2C%201912%20%2015%20more%20rows%20'
Step 3:
Using a built-in pandas
function pull in a list of the tables on the web page. This step will create a list and assign it to an object.
website_df = pd.read_html(url)
Step 4:
In order to validate you have read in what you think you have, use the type
function:
type(website_df)
In this case you should get the result, list
. We can check how many items are in this list by using the len
function:
len(website_df)
Step 5:
Since a website can have multiple tables and the pandas
function brings in the tables as a list, it is worth doing some discovery into each item in the list to see which table we want to access.
In the case of the Wikipedia page above, there are 19 tables in the list. We can see what each of these tables looks like by utilizing the indexing system.
Checking the first item (index = 0
) we can see it is the header at the top of the page, but if we pull the second table in the list (index = 1
), we will get a table of US States:
website_df[1]
The output of this looks something like:
Conclusion
In this post we have covered how to access data on a website but there are many different locations data can be housed; for example, the previous post in this series covers how to access data housed in an Excel file. Now that we have accessed some data, I’m sure you’re asking, “What can we do with it?” Though we did not answer that question in this post, stick around over the next few posts, and I will provide more information on what you can do with the data that you now have.
Associated Files / Resources:
Credits:
Photo by Christina Morillo