Introduction
Let me start out by saying, this is not the post I had planned but it is the post that I am writing because I think it is interesting and potentially useful. In this post I will regale you with the story behind why I decided to use Python to solve my particular problem and how it came about. Following this tantalizing story, I will actually walk you through what I did in Python using a dummy data set that represents what I was exporting from SQL.
The Background Story
Let me set the stage for you…
I am working on creating a new SQL query that will eventually be adopted into views that my team use on a daily basis. I’ve been working on this for a few weeks, and I finally get to the point where I want to output a pivoted summary table of the data so I can do some validation, because looking at hundreds of thousands of rows of data isn’t going to be useful to figure out what I’m looking at.
I painstakingly write the hundreds of lines of code that I need to write in SQL to “pivot” my data, only to run into a memory issue. I try a variety of optimization techniques, but nothing works! What can I do?
Well, let’s be honest. I have a lot of options. I can put the SQL into a query in PowerBI and use the power of PowerBI to create my table and many other visuals. I can export the data to Excel and play with it in there. I could have explored other SQL options. Needless to say, I had options; however, my mind thought… “Can Python do this? What are its capabilities? Should I use this opportunity as a chance to explore Python a bit?”
At first, I thought I should just user PowerBI or Excel because I know them, I am familiar with them, I could probably do what I needed to relatively quickly. After thinking about it though I decided that the potential delay by trying it in Python was worth it because it was a learning opportunity!
One more quick tidbit of information that may be useful, I am trying, with this code, to create a linear funnel of my data. So, there is a category and then a few categories where the numbers are subtracted from the category above to then create a more directed category. This is repeated a few times to get to the ultimate goal category but outlining along the way why data points are being removed by putting them in categories. This may sound very abstract; I hope this will make more sense once we look at the data and code.
The Use Case
The Data
First, let’s take a look at the shape of the data. So, my base SQL code was exporting hundreds of thousands of lines of data, but as stated above I was interested in the summary data. SQL couldn’t pivot this summary data; this is the issue I am trying to solve with Python. My summary table has 10 rows and 26 columns. The first three columns are categories that will be used to pivot the data into columns later. The remaining columns are the metrics that I want to view in a linear fashion down a column. For simplicity, and the fact that I could not think of any other potential metric options, I have paired the sample data down to 13 columns.
Note the following concerning the dataset:
E_correct_options = A_all_options - B_wrong_color - C_wrong_brand - D_wrong_dimensions
H_just_right_options = E_correct_options - F_too_old - G_too_new
J_still_want_options = H_just_right_options - I_already_owned
The Code
Now that we have an idea of what the data looks like, let’s get this data into Python. If you have any questions on this part, refer to my previous post on accessing data in Excel.
Step 1:
Import your libraries.
# Import Libraries
import pandas as pd
import numpy as np
Step 2:
Store your Excel file path in an object.
# Create File Path Object
path = r'C:\Users\britny.sarver\Documents\Python\SampleData\DummyData_RandomUseCase.xlsx'
Step 3:
Store the data from your Excel file in an object.
# Create Data Frame from Excel File
df = pd.read_excel(path)
Step 4:
Check the shape of your data.
# Check the Shape of the Data (rows, columns)
df.shape
Using the provided sample data, you should get an output of (10, 13)
.
Step 5:
Check that you are importing the right data by looking at it.
# View the Data in a Table
df
Using the provided sample data, you should get an output that looks like:
Step 6:
Unpivot the metrics in the table. In order to accomplish this, utilize the melt
function. This function requires a few inputs:
-
col_level
: Allows you to specify the level at which to melt your data. -
id_vars
: A list of columns used as identifiers / categories. -
value_vars
: A list of columns to unpivot the data for. -
var_name
: Give the variable column a specific name. -
value_name
: Give the value column a specific name.
# Unpivot the Metrics
df_unpivoted = df.melt(
col_level = 0
, id_vars = ['sizes', 'availability', 'price_range']
, value_vars = [
'A_all_options'
, 'B_wrong_color'
, 'C_wrong_brand'
, 'D_wrong_dimensions'
, 'E_correct_options'
, 'F_too_old'
, 'G_too_new'
, 'H_just_right_options'
, 'I_already_owned'
, 'J_still_want_options'
]
, var_name = 'options_category'
, value_name = 'number_of_options'
)
Step 7:
Check the shape of the new dataset.
# Check the Shape of the Data (rows, columns)
df_unpivoted.shape
Using the provided sample data, you should get an output of (100, 5)
.
Step 8:
Make sure that the Pandas settings will allow you to view all your data. You can do this by setting the max columns and rows to match the results from above. Then, view the new dataset to check it is what you are expecting.
# Set Pandas Settings to Display All Data in Table
pd.set_option('display.max_columns', 5)
pd.set_option('display.max_rows', 100)
# View the Data in a Table
df_unpivoted
Using the provided sample data, you should get an output that looks like:
Step 9:
Create a data frame that contains all of the rows where the size is S (small) and create da data frame that contains all of the rows where the size is L (large). Our goal is to have columns in our final table for each option in size, small and large, overall and for each of the other sub-categories. We will address this in the next step.
These data frames are created using the Pandas feature that allows us to filter rows by selecting a column and specifying the values in that column we want to keep.
# Create data frames for small and large options
df_s = df_unpivoted[df_unpivoted.sizes.isin(['S'])]
df_l = df_unpivoted[df_unpivoted.sizes.isin(['L'])]
Step 10:
Create a data frame for small options that are also available by utilizing the above created filtered data frame and the availability
column. Similarly, create a data frame for large options that are also available.
In order to create these data frames, we will utilize the Pandas function pivot_table
. For this to work we need a few inputs:
- The name of the data frame to pivot.
-
values
: The column we want to aggregate. -
index
: The column we want to group by and create rows. -
columns
: The column we want to group by and create columns. -
aggfunc
: The function in which we want to utilize to aggregate thevalues
. -
fill_value
: The value in which we want to use if the aggregated value is null. In order for this to make sense later, I also utilized the Pandas functionrename
to rename the column created fromavailability
.
# Create data frames for Small and Large Available options by option categories
df_s_available = pd.pivot_table(
df_s
, values = 'number_of_options'
, index = ['options_category']
, columns = ['availability']
, aggfunc = np.sum
, fill_value = 0
).rename(columns = {"Available" : "S Available"})
df_l_available = pd.pivot_table(
df_l
, values = 'number_of_options'
, index = ['options_category']
, columns = ['availability']
, aggfunc = np.sum
, fill_value = 0
).rename(columns = {"Available" : "L Available"})
Step 11:
Create a data frame for small options that are also average priced by utilizing the filtered data frames created in step 9 and the price_range
column. Similarly, create a data frame for large options that are also average priced.
Similar to step 10, we will utilize the Pandas functions pivot_table
and rename
.
# Create data frames for Small and Large Average Priced options by option categories
df_s_average = pd.pivot_table(
df_s
, values = 'number_of_options'
, index = ['options_category']
, columns = ['price_range']
, aggfunc = np.sum
, fill_value = 0
).rename(columns = {"Average" : "S Avg Price"})
df_l_average = pd.pivot_table(
df_l
, values = 'number_of_options'
, index = ['options_category']
, columns = ['price_range']
, aggfunc = np.sum
, fill_value = 0
).rename(columns = {"Average" : "L Avg Price"})
Step 12:
Create a data frame for small options by utilizing the filtered data frames created in step 9 and the sizes
column. Similarly, create a data frame for large options. Similar to step 10, we will utilize the Pandas function pivot_table
.
# Create data frame for small and large options by option categories
df_s_ = pd.pivot_table(
df_s
, values = 'number_of_options'
, index = ['options_category']
, columns = ['sizes']
, aggfunc = np.sum
, fill_value = 0
)
df_l_ = pd.pivot_table(
df_l
, values = 'number_of_options'
, index = ['options_category']
, columns = ['sizes']
, aggfunc = np.sum
, fill_value = 0
)
Step 13:
Finally, combine the data frames created in steps 10 through 12 into one data frame that will display the data as desired. In order to accomplish this we will utilize Pandas concat
function. This requires a few inputs:
- A list of data frames we wish to combine, in the order we want to combine them.
-
axis
: The column in which to join the datasets on. I also wanted to make the number output look nice, so I utilized the Pandas functionstyle.format
to add thousands separators to the numbers, where appropriate.
# Combine into the desired dataframe structure
df_final = pd.concat(
[
df_s_
, df_s_available
, df_s_average
, df_l_
, df_l_available
, df_l_average
]
, axis = 1
).style.format('{:,}')
# Display the summarized and pivoted data
df_final
Using the provided sample data, you should get an output that looks like:
Conclusion
I by no means am claiming this is the only, nor the best, way to accomplish the goal of formatting the data the way I wanted to see it; however, I found this exercise both interesting and educational. I thought it would be something interesting to share. If anything, maybe it can spark some creativity in all of you to go out and try something outside the box. At least, outside the box for you. I know this was a great experience for me! Hence why I am writing about this instead of my planned topic of cleaning data. Stick around for more posts in this series for some fun and educational posts about getting started with Python as a Data Analyst!
Associated Files:
Credits:
Photo by Christina Morillo