Securing AWS Glue: A Guide to Identifying and Fixing Python Package Vulnerabilities
Introduction
Did you know that the default Python packages in AWS Glue contain a number of known vulnerabilities? While instances, containers, and Lambda functions are often scanned by tools like AWS Inspector, Trivy, and Snyk, data pipelines are frequently overlooked. Whether by accident or design, many data pipelines—often laden with Python code—interact with external systems and APIs to ingest data. As such, securing these pipelines is just as important as securing any other part of your infrastructure.
In this post, I’ll walk you through how to enhance the security of your AWS Glue data pipelines. The first issue I encountered is that these pipelines often combine system and runtime dependencies with application code. AWS Glue and Apache Airflow both provide Python environments with pre-installed packages, along with the option to add custom ones.
AWS Glue
For this post, I'll focus specifically on the AWS Glue environment.
AWS Glue allows you to create three types of jobs:
Glue ETL (PySpark):
Glue ETL Python LibrariesPython Shell:
Python Shell Jobs in AWS GlueRay (not supported):
The Glue ETL job spins up an on-demand Spark environment, while the Python Shell is more akin to a Lambda function. It doesn’t have the same 15-minute time limit but does have limited capacity.
Exporting System Requirements
While browsing the Glue documentation, I came across tables listing the pre-installed Python packages. I wrote a small program to parse these tables and export them to a requirements.txt
file.
For a Python Shell job using Python 3.9, this is the output:
awscli==1.23.5
botocore==1.23.5
For Python Shell jobs, there’s also an option to set the library-set
to analytics
, which provides a set of commonly-used packages, including the useful AWS SDK for pandas. However, note that the version included is fairly outdated:
avro==1.11.0
awscli==1.23.5
awswrangler==2.15.1
botocore==1.24.21
boto3==1.21.21
elasticsearch==8.2.0
numpy==1.22.3
pandas==1.4.2
psycopg2==2.9.3
pyathena==2.5.3
PyMySQL==1.0.2
pyodbc==4.0.32
pyorc==0.6.0
redshift-connector==2.0.907
requests==2.27.1
scikit-learn==1.0.2
scipy==1.8.0
SQLAlchemy==1.4.36
s3fs==2022.3.0
Now we have the system dependencies in a workable format.
Run-Time Dependencies
AWS Glue also allows you to install additional packages at runtime using pip
. You can extend or override the pre-installed Python packages as needed.
For more details, check the official AWS Glue Programming Python Libraries documentation.
Glue Inspector
With the above information, I created a tool called Glue Inspector. It downloads the AWS system dependencies, caches them locally, and then retrieves runtime dependencies. These are merged into a list and exported as a CycloneDX Software Bill of Materials (SBOM) in JSON format.
To use it:
- Set your AWS credentials in the environment.
- Run the following command to inspect a Glue job:
glue-inspector inspect mygluejob --output mygluejob-sbom.json
You can then use the resulting SBOM to manage the software supply chain with tools like DependencyTrack, or scan for vulnerabilities using tools like Trivy:
trivy sbom mygluejob-sbom.json --scanners vuln,license --list-all-pkgs -d --format cyclonedx --output mygluejob-sbom-trivy.json
I’ve just released version 0.2.0 of Glue Inspector.
AWS Vulnerabilities in Glue
While working on this tool, I was surprised by the number of critical and high-severity vulnerabilities present in the default packages. I filed a report with AWS Security, and after weeks of waiting, I was told that the runtime is isolated and therefore not considered an AWS system issue. However, users are encouraged to update their packages as needed.
I believe more awareness is needed in this area.
Glue Runtime Vulnerabilities
Here’s an overview of vulnerabilities in the Glue runtimes:
Filename | Critical | High | Medium | Low |
---|---|---|---|---|
glueetl-2.0 | 5 | 12 | 12 | 1 |
glueetl-3.0 | 4 | 16 | 20 | 2 |
glueetl-4.0 | 4 | 14 | 18 | 2 |
glueetl-5.0 | 0 | 6 | 11 | 3 |
pythonshell-3.6 | 1 | 1 | 6 | 0 |
pythonshell-3.9 | 0 | 0 | 0 | 0 |
pythonshell-3.9-analytics | 1 | 1 | 3 | 0 |
Vulnerabilities in AWS Glue 5.0 GlueETL
Here are some critical and high-severity vulnerabilities in the newly released Glue ETL 5.0 runtime:
Package | Severity | Id | Installed Version | Fixed Version | Title |
---|---|---|---|---|---|
Pygments | MEDIUM | CVE-2022-40896 | 2.7.4 | 2.15.0 | pygments: ReDoS in pygments |
aiohttp | MEDIUM | CVE-2024-42367 | 3.10.1 | 3.10.2 | aiohttp: python-aiohttp: Compressed files as symlinks are not protected from path traversal |
aiohttp | MEDIUM | CVE-2024-52304 | 3.10.1 | 3.10.11 | aiohttp: aiohttp vulnerable to request smuggling due to incorrect parsing of chunk extensions |
cryptography | HIGH | CVE-2023-0286 | 36.0.1 | 39.0.1 | openssl: X.400 address type confusion in X.509 GeneralName |
cryptography | HIGH | CVE-2023-50782 | 36.0.1 | 42.0.0 | python-cryptography: Bleichenbacher timing oracle attack against RSA decryption - incomplete fix for CVE-2020-25659 |
cryptography | MEDIUM | CVE-2023-23931 | 36.0.1 | 39.0.1 | python-cryptography: memory corruption via immutable objects |
cryptography | MEDIUM | CVE-2023-49083 | 36.0.1 | 41.0.6 | python-cryptography: NULL-dereference when loading PKCS7 certificates |
cryptography | MEDIUM | CVE-2024-0727 | 36.0.1 | 42.0.2 | openssl: denial of service via null dereference |
cryptography | LOW | GHSA-5cpq-8wj7-hf2v | 36.0.1 | 41.0.0 | Vulnerable OpenSSL included in cryptography wheels |
cryptography | LOW | GHSA-jm77-qphf-c4w8 | 36.0.1 | 41.0.3 | pyca/cryptography's wheels include vulnerable OpenSSL |
cryptography | LOW | GHSA-v8gr-m533-ghj9 | 36.0.1 | 41.0.4 | Vulnerable OpenSSL included in cryptography wheels |
idna | MEDIUM | CVE-2024-3651 | 2.10 | 3.7 | python-idna: potential DoS via resource consumption via specially crafted inputs to idna.encode() |
pip | MEDIUM | CVE-2023-5752 | 21.3.1 | 23.3 | pip: Mercurial configuration injectable in repo revision when installing via pip |
pip | MEDIUM | CVE-2023-5752 | 22.3.1 | 23.3 | pip: Mercurial configuration injectable in repo revision when installing via pip |
setuptools | HIGH | CVE-2022-40897 | 59.6.0 | 65.5.1 | pypa-setuptools: Regular Expression Denial of Service (ReDoS) in package_index.py |
setuptools | HIGH | CVE-2024-6345 | 59.6.0 | 70.0.0 | pypa/setuptools: Remote code execution via download functions in the package_index module in pypa/setuptools |
urllib3 | HIGH | CVE-2021-33503 | 1.25.10 | 1.26.5 | python-urllib3: ReDoS in the parsing of authority part of URL |
urllib3 | HIGH | CVE-2023-43804 | 1.25.10 | 2.0.6, 1.26.17 | python-urllib3: Cookie request header isn't stripped during cross-origin redirects |
urllib3 | MEDIUM | CVE-2023-45803 | 1.25.10 | 2.0.7, 1.26.18 | urllib3: Request body not stripped after redirect from 303 status changes request method to GET |
urllib3 | MEDIUM | CVE-2024-37891 | 1.25.10 | 1.26.19, 2.2.2 | urllib3: proxy-authorization request header is not stripped during cross-origin redirects |
Mitigating Vulnerabilities
If your Glue jobs access external resources, be sure to update the required packages using the runtime installation option. However, this could lead to a "dependency hell" situation, so use your favorite tools or something like pur to help update the requirements.
Here’s an overview of some key packages that are outdated:
Updated aiobotocore: 2.13.1 -> 2.16.1
Updated aiohappyeyeballs: 2.3.5 -> 2.4.4
Updated aiohttp: 3.10.1 -> 3.11.11
Updated aioitertools: 0.11.0 -> 0.12.0
Updated aiosignal: 1.3.1 -> 1.3.2
Updated async-timeout: 4.0.3 -> 5.0.1
Updated attrs: 24.2.0 -> 24.3.0
Updated awscrt: 0.19.19 -> 0.23.6
Updated boto3: 1.34.131 -> 1.35.92
Updated botocore: 1.34.131 -> 1.35.92
Updated certifi: 2024.7.4 -> 2024.12.14
Updated cffi: 1.14.5 -> 1.17.1
Updated charset-normalizer: 3.3.2 -> 3.4.1
Updated colorama: 0.4.4 -> 0.4.6
Updated contourpy: 1.2.1 -> 1.3.1
Updated cryptography: 36.0.1 -> 44.0.0
Updated distlib: 0.3.1 -> 0.3.9
Updated distro: 1.5.0 -> 1.9.0
Updated docutils: 0.16 -> 0.21.2
Updated filelock: 3.0.12 -> 3.16.1
Updated fonttools: 4.53.1 -> 4.55.3
Updated frozenlist: 1.4.1 -> 1.5.0
Updated fsspec: 2024.6.1 -> 2024.12.0
Updated idna: 2.10 -> 3.10
Updated importlib_resources: 6.4.0 -> 6.5.2
Updated jmespath: 0.10.0 -> 1.0.1
Updated kiwisolver: 1.4.5 -> 1.4.8
Updated libcomps: 0.1.20 -> 0.1.21.post1
Updated matplotlib: 3.9.0 -> 3.10.0
Updated multidict: 6.0.5 -> 6.1.0
Updated numpy: 1.26.4 -> 2.2.1
Updated packaging: 24.1 -> 24.2
Updated pandas: 2.2.2 -> 2.2.3
Updated pillow: 10.4.0 -> 11.1.0
Updated pip: 21.3.1 -> 24.3.1
Updated pip: 22.3.1 -> 24.3.1
Updated plotly: 5.23.0 -> 5.24.1
Updated prompt-toolkit: 3.0.24 -> 3.0.48
Updated pyarrow: 17.0.0 -> 18.1.0
Updated pycparser: 2.20 -> 2.22
Updated Pygments: 2.7.4 -> 2.19.0
Updated pyparsing: 3.1.2 -> 3.2.1
Updated pytz: 2024.1 -> 2024.2
Updated requests: 2.32.2 -> 2.32.3
Updated ruamel.yaml: 0.16.6 -> 0.18.9
Updated ruamel.yaml.clib: 0.1.2 -> 0.2.12
Updated s3fs: 2024.6.1 -> 2024.12.0
Updated s3transfer: 0.10.2 -> 0.10.4
Updated setuptools: 59.6.0 -> 75.7.0
Updated six: 1.16.0 -> 1.17.0
Updated tzdata: 2024.1 -> 2024.2
Updated urllib3: 1.25.10 -> 2.3.0
Updated virtualenv: 20.4.0 -> 20.28.1
Updated wcwidth: 0.2.5 -> 0.2.13
Updated wrapt: 1.16.0 -> 1.17.0
Updated yarl: 1.9.4 -> 1.18.3
Updated zipp: 3.19.2 -> 3.21.0
Luckily, Glue 5 now supports the use of a requirements.txt
file uploaded to S3, which can be parsed by pip
:
This opens up the possibility of using local checks and tools like GitHub Dependabot to monitor your dependencies for vulnerabilities.
Conclusion
Data pipelines are applications and need to be treated with the same level of scrutiny as any other software. Managing their lifecycle is critical for security.
Be aware of vulnerabilities in default runtimes, whether using AWS Glue, Apache Airflow, or other similar tools.
Use Glue Inspector to scan your Glue jobs and generate an SBOM for better software supply chain management. SBOMs are becoming an industry standard, with requirements from norms like DORA and U.S. government standards for critical infrastructure.