In this post, we will compare the implementation of Pandas and SQL for data queries. We'll explore how to use Pandas in a manner similar to SQL by translating SQL queries into Pandas operations.
It's important to note that there are various ways to achieve similar results, and the translation of SQL queries to Pandas will be done by employing some of its core methods.
We aim to explore the diverse Python Pandas methods, focusing on their application through the nycflights13 datasets. This datasets offer comprehensive information about airlines, airports, weather conditions, and aircraft for all flights passing through New York airports in 2013.
Through this exercise, we'll not only explore Pandas functionality but also learn to apply fundamental SQL concepts in a Python data manipulation environment.
Entity-relationship diagram [ERD]
The nycflights13 library contains tables with flight data from New York airports in 2013. Below, you can find a high-level representation of an entity-relationship diagram with its five tables.
Installation: Setting Up nycflights13
To install the nycflights13 library, you can use the following command:
!pip install nycflights13
This library provides datasets containing comprehensive information about flights from New York airports in 2013. Once installed, you can easily access and analyze this flight data using various tools and functionalities provided by the nycflights13 package.
π’ Pandas, NumPy, and nycflights13 for Data Analysis in Python
In the next code snippet, we are importing essential Python libraries for data analysis.
π Pandas is a library for data manipulation and analysis,
π Numpy provides support for numerical operations
π Nycflights13 is a specialized library containing datasets related to flights from New York airports in 2013.
(flights.filter(['year','month','day','dep_time','flight','tailnum','origin','dest','time_hour','distance']).query("(origin in ['JFK', 'EWR', 'LGA'])"" and (dest != 'MIA') "" and (distance <= 1000)").head(10))
π Utilizing 'WHERE' with between operator
To achieve the same filtering in Pandas for specific criteria:
βοΈ Flights departing from JFK, LGA, or EWR.
βοΈ Flights not destined for Miami (MIA).
βοΈ Flights with a distance less than or equal to 1000 km.
βοΈ Flights within the period from September 1, 2013, to September 30, 2013.
(flights.filter(['year','month','day','dep_time','flight','tailnum','origin','dest','time_hour','distance']).query("(origin in ['JFK', 'EWR', 'LGA'])"" and (dest != 'MIA')"" and (distance <= 1000)"" and ('2013-09-01' <= time_hour <= '2013-09-30')").head(10))
π Utilizing 'WHERE' with "LIKE" Clause
To achieve the same filtering in Pandas for specific criteria:
βοΈ Flights departing from JFK, LGA, or EWR.
βοΈ Flights not destined for Miami (MIA).
βοΈ Flights with a distance less than or equal to 1000 km.
βοΈ Flights within the period from September 1, 2013, to September 30, 2013.
βοΈ Flights where the tailnum contains 'N5' in the text.
(flights.filter(['year','month','day','dep_time','flight','tailnum','origin','dest','time_hour']).query(" (origin in ['JFK', 'EWR', 'LGA'])"" and (dest != 'MIA') "" and ('2013-09-01' <= time_hour <= '2013-09-30')"" and (tailnum.str.find('N5')>=0)").head(10))
π Utilizing 'WHERE' with Null or Not Null Values
To achieve the same filtering in Pandas for specific criteria:
βοΈ Flights departing from JFK, LGA, or EWR.
βοΈ Flights not destined for Miami (MIA).
βοΈ Flights with a distance less than or equal to 1000 km.
βοΈ Flights within the period from September 1, 2013, to September 30, 2013.
βοΈ Flights where the tailnum contains 'N5' in the text.
(flights.filter(['year','month','day','dep_time','flight','tailnum','origin','dest','time_hour']).query(" (origin in ['JFK', 'EWR', 'LGA'])"" and (dest != 'MIA') "" and ('2013-09-01' <= time_hour <= '2013-09-30')"" and (tailnum.str.find('N5')>=0)"" and dep_time.isnull()").head(10))
π’ Order by Statement
The .sort_values() methods in Pandas are equivalent to the ORDER BY clause in SQL.
1οΈβ£. **.sort_values(['origin','dest'], ascending=False)**: This method sorts the DataFrame based on the 'origin' and 'dest' columns in descending order (from highest to lowest). In SQL, this would be similar to the ORDER BY origin DESC, dest DESC clause.
2οΈβ£. **.sort_values(['day'], ascending=True)**: This method sorts the DataFrame based on the 'day' column in ascending order (lowest to highest). In SQL, this would be similar to the ORDER BY day ASC clause.
Both methods allow you to sort your DataFrame according to one or more columns, specifying the sorting direction with the ascending parameter. True means ascending order, and False means descending order.
(flights.filter(['year','month','day','dep_time','flight','tailnum','origin','dest','time_hour']).query(" (origin in ['JFK', 'EWR', 'LGA'])"" and (dest != 'MIA') "" and ('2013-09-01' <= time_hour <= '2013-09-30')"" and (tailnum.str.find('N5')>=0)"" and year.notnull()").sort_values(['origin','dest'],ascending=False).head(10))
π’ Distinct Values: Removing Duplicates from Results
To perform a distinct select in pandas, you need to first execute the entire query, and then apply the drop_duplicates() method to eliminate all duplicate rows.
(flights.filter(['origin','dest','time_hour','dep_delay','
arr_delay']).assign(delay_total=flights.dep_delay+flights.arr_delay).query(" (origin in ['JFK', 'EWR', 'LGA'])"" and (dest != 'MIA') "" and ('2013-09-01' <= time_hour <= '2013-09-30')"))
π’ Group by Statement
To perform a GROUP BY operation in pandas, we'll use the groupby method, which operates similarly to its SQL counterpart. Similarly, we can employ common aggregate functions such as sum, max, min, mean (equivalent to avg in SQL), and count. Below is a simple example to illustrate this process:
In the following example, we'll explore how to implement a HAVING clause in pandas, leveraging the query method, as we've done previously for filtering.
(flights.groupby(['year','month'],as_index=False)['dep_delay'].max().query('(dep_delay>1000)')# having
)
π’ Group by with multiple calculations
When working with pandas and needing to perform multiple calculations on the same column or across different columns, the agg function becomes a valuable tool. It allows you to specify a list of calculations to be applied, providing flexibility and efficiency in data analysis.
This query retrieves aggregated information from the "flights" dataset, calculating various statistics like maximum, minimum, mean, count, and sum for both "dep_delay" and "arr_delay" columns. To achieve a similar result in pandas, we use the agg function, which allows us to specify these calculations concisely and efficiently. The resulting DataFrame provides a clear summary of the specified metrics for each combination of "year" and "month."
π python
result=(flights.groupby(['year','month'],as_index=False).agg({'dep_delay':['max','min','mean','count'],'arr_delay':['max','min','sum']}))# Concatenate function names with column names
result.columns=result.columns.map('_'.join)# Print the results
result
π’ Union Statement
To execute a UNION ALL operation in Pandas, it is necessary to create two DataFrames and concatenate them using the concat method. Unlike SQL, a DataFrame in Pandas can be combined to generate additional columns or additional rows. Therefore, it is essential to define how the concatenation should be performed:
axis=1 => Union that appends another dataset to the right, generating more columns.
axis=0 => Union that appends more rows.
In our example, we will perform the equivalent of a UNION ALL in SQL, so we will use axis=0.
Flights_NYC=(flights.filter(['year','month','day','dep_time','flight','tailnum','origin','dest','time_hour','dep_delay','arr_delay']).assign(delay_total=flights.dep_delay+flights.arr_delay).query(" (origin in ['JFK', 'EWR', 'LGA'])"" and ('2013-09-01' <= time_hour <= '2013-09-30')").assign(group='NYC').sort_values('delay_total',ascending=False).head(3))Flights_MIAMI=(flights.filter(['year','month','day','dep_time','flight','tailnum','origin','dest','time_hour','dep_delay','arr_delay']).assign(delay_total=flights.dep_delay+flights.arr_delay).query(" (dest in ['MIA', 'OPF', 'FLL'])"" and ('2013-07-01' <= time_hour <= '2013-09-30')").assign(group='MIA').sort_values('delay_total',ascending=False).head(2))# union all
pd.concat([Flights_NYC,Flights_MIAMI],axis=0)
π’ CASE WHEN Statement
To replicate the CASE WHEN statement, we can use two different methods from NumPy:
1οΈβ£. If there are only two conditions, for example, checking if the total delay exceeds 0, then we label it as "Delayed"; otherwise, we label it as "On Time". For this, the np.where method from NumPy is utilized.
2οΈβ£. In case there are more conditions, such as identifying Miami airports and labeling them as "MIA", labeling "ATL" airports that they are in Altanta, and for any other cases, using the label "OTHER". For this, the np.select method from NumPy is employed.
In this exercise, we will compare the implementation of Pandas and SQL for data queries. We'll explore how to use Pandas in a manner similar to SQL by translating SQL queries into Pandas operations. It's important to note that there are various ways to achieve similar results, and the translation of SQL queries to Pandas will be done by employing some of its core methods.
We'll dive into the nycflights13 dataset, which contains comprehensive data on airlines, airports, weather conditions and aircraft for all flights passing through New York airports in 2013. Through this exercise, we'll not only explore Pandas functionality but also learn to apply fundamental SQL concepts in a Python data manipulation environment. This comparison serves as an initial step to delve into translating SQL queries to Pandas, utilizingβ¦