Python Cheat Sheet for Data Engineers and Data Scientists!

Pavan Belagatti - Aug 31 '23 - - Dev Community

Python has become an indispensable tool for both Data Engineers and Data Scientists due to its simplicity, readability, and extensive library ecosystem. For Data Engineers, Python offers robust libraries like Pandas for data manipulation, PySpark for big data processing, and SQLAlchemy for database interactions, making it easier to build scalable data pipelines. It also integrates well with cloud services and various data storage systems, streamlining the ETL (Extract, Transform, Load) processes.

On the other hand, Data Scientists benefit from Python's rich array of machine learning libraries like scikit-learn, TensorFlow, and PyTorch, as well as data visualization libraries like Matplotlib and Seaborn. Its versatility allows for end-to-end data analysis, from data collection to model deployment, all within a single programming environment. This commonality of language fosters better collaboration between Data Engineers and Data Scientists, making Python a unifying thread in the data ecosystem.

The PYPL PopularitY of Programming Language Index is created by analyzing how often language tutorials are searched on Google.

Python popularity

Python Cheat Sheet for Data Enthusiasts

The cheat sheet provided is a concise overview of essential Python topics and libraries commonly used in data engineering and data science.

- Python Basics

Variables: How to declare and initialize different types of variables.



x = 10  # Integer
y = 3.14  # Float
name = "Alice"  # String
is_valid = True  # Boolean


Enter fullscreen mode Exit fullscreen mode

Lists: Basic operations for creating and manipulating Python lists.



my_list = [1, 2, 3]
my_list.append(4)  # Adds 4 to the end


Enter fullscreen mode Exit fullscreen mode

Dictionaries: How to create and use key-value pairs in Python dictionaries.



my_dict = {"key": "value", "name": "Alice"}


Enter fullscreen mode Exit fullscreen mode

Loops: Using for loops to iterate over a sequence of numbers.



for i in range(5):
    print(i)


Enter fullscreen mode Exit fullscreen mode

- NumPy

Importing NumPy: How to import the NumPy library for numerical operations.



import numpy as np


Enter fullscreen mode Exit fullscreen mode

Creating Arrays: Creating a basic NumPy array.



a = np.array([1, 2, 3])


Enter fullscreen mode Exit fullscreen mode

Basic Operations: Performing element-wise addition and subtraction.



a + b  # Element-wise addition
a - b  # Element-wise subtraction


Enter fullscreen mode Exit fullscreen mode

- Pandas

Importing Pandas: How to import the Pandas library for data manipulation.



import pandas as pd


Enter fullscreen mode Exit fullscreen mode

Creating DataFrame: Creating a simple Pandas DataFrame.



df = pd.DataFrame({"col1": [1, 2], "col2": [3, 4]})


Enter fullscreen mode Exit fullscreen mode

Reading CSV: Reading data from a CSV file into a DataFrame.



df = pd.read_csv("file.csv")


Enter fullscreen mode Exit fullscreen mode

Basic Operations: Viewing the first 5 rows and summary statistics of a DataFrame.



df.head()  # First 5 rows
df.describe()  # Summary statistics


Enter fullscreen mode Exit fullscreen mode

- Matplotlib

Importing Matplotlib: How to import the Matplotlib library for plotting.



import matplotlib.pyplot as plt


Enter fullscreen mode Exit fullscreen mode

Basic Plotting: Creating a simple line plot.



plt.plot([1, 2, 3], [4, 5, 6])
plt.show()


Enter fullscreen mode Exit fullscreen mode

- Scikit-Learn

Importing Scikit-Learn: How to import the Scikit-Learn library for machine learning.



from sklearn.linear_model import LinearRegression


Enter fullscreen mode Exit fullscreen mode

Fitting a Model: Training a linear regression model.



model = LinearRegression()
model.fit(X_train, y_train)


Enter fullscreen mode Exit fullscreen mode

Making Predictions: Using the trained model to make predictions.



predictions = model.predict(X_test)



Enter fullscreen mode Exit fullscreen mode

- SQL Operations with Python

Using SQLite: How to connect to an SQLite database and execute a SQL query.



import sqlite3

conn = sqlite3.connect("database.db")
cursor = conn.cursor()
cursor.execute("SELECT * FROM table_name")


Enter fullscreen mode Exit fullscreen mode

- Data Cleaning with Python

Handling Missing Values: Dropping or filling missing values in a DataFrame.



df.dropna()  # Drop missing values
df.fillna(0)  # Fill missing values with 0


Enter fullscreen mode Exit fullscreen mode

Type Conversion: Converting the data type of a DataFrame column.



df['column'].astype('int')  # Convert to integer


Enter fullscreen mode Exit fullscreen mode

Note: This might not serve as a complete cheat sheet. The Data science is a vast field and mentioning everything might not be a possible option. If I missed something important, please let me know in the comments.

Checkout my other two articles on Vector Database and LangChain.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .