Very often it makes sense to use a managed service instead of undifferentiated heavy lifting of properly building and maintaining infrastructure. For me, managing Apache Airflow definitely falls into this category and I often use AWS MWAA (Managed Workflows for Apache Airflow).
As many of you have worked with Airflow already know, customizations, especially modifications to the Python environment can be tricky, and in some cases dangerous. This is mainly due to the fact that Airflow itself is a complex Python application with it's own environmental considerations and dependencies.
This is why I continue to campaign that folks keep their Airflow environment small and purposeful, and reduce customizations by using tools like the pod operator. I detail much of this in my article The Wrath of Unicron.
However it's very difficult to stay completely vanilla 🍦, so here are a few tips when customizing the MWAA environment.
Tip 0: Use MWAA Local Runner
I won't go into great detail here, because the docs are quite good. But you should absolutely be developing, and testing all changes leveraging MWAA Local Runner. It's very close to the real thing and you will avoid waiting for changes to propagate in the actual MWAA environment (my one complaint is 20-40 minutes for an environment update is kinda crazy).
Tip 1: LOGGING!!
Before you start any customization, turn your logging up to 11.
You will need all the detailed log entries, especially for the Scheduler. If your MWAA environment is not recognizing your changes, or getting stuck in the updating state (crash loop), check the requirements log entries.
Tip 2: Constraints File
For the past several versions of Airflow, a public constraints file has been published and maintained. This constraint file protects Airflows dependencies and makes sure that customizations do not break things.
⚠️ With MWAA, messing up dependencies can cause the before-mentioned crash loop, which can often last for hours 😭.
A constraint statement pointing to this file must be referenced in the top of your requirements.txt and will look something like this.
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-{Airflow-version}/constraints-{Python-version}.txt"
DO NOT OMIT THE CONSTRAINT STATEMENT!!!
..but if you can't make the default file work, see Tip 3 below 😃
Tip 3: Unresolvable Conflicts
Unfortunately not all Python packages are well maintained or have tight locking to upstream dependency versions. Over time, you can run into unresolvable conflicts between your packages and plugins, and the constraints file.
The first recommendation - upgrade to the latest version of Airflow. It's very likely the problems you are experiencing are resolved in the latest version. If this is not an option, I suggest certifying and hosting your own version of the constraint file. This will involve tweaking the package dependencies and making sure they are compatible with Airflow.
This may not be a trivial process, but try your best not to comment lines, and absolutely do not remove the constraints statement all together.
Tip 4: Troubleshooting Plugins
Maybe you did think you did everything right (per the docs), your MWAA environment is booting, but your Plugins are not installing.
First, check your requirements log entries and see if anything has failed during the install. Note that in most cases the requirements install will do a complete rollback of package installs, not just the offenders.
If you don't see your package being installed make sure you referenced your package correctly in your requirements file.
/usr/local/airflow/plugins/data_common_utils-0.2.8-py3-none-any.whl
If all this looks correct, try creating a simple "plugin finder" DAG, and make sure your plugin has been copied to the hosted environment.
from airflow.utils.dates import days_ago
from airflow import DAG
from airflow.operators.bash import BashOperator
dag = DAG(dag_id = 'plugin-finder', start_date=days_ago(1))
ls_airflow_plugins = BashOperator(
task_id="ls_airflow_plugins",
bash_command="ls -laR /usr/local/airflow/plugins",
dag=dag,
priority_weight=300,
)
Good Luck!
I hope you all find this helpful. Please comment with other helpful tips!