Thanks for the illuminating post.
I like how Apache Airflow is used to move the pyspark script to a S3 location so that it can be read by the EMR step.
I remember working on a project where we wanted to automate a data pipeline using Airflow and had this problem of how to get our pipeline scripts to the right locations.