Triggering Machine Learning Workflows with New Data in AWS S3

Consider you’re an e-commerce platform aiming to enhance recommendation personalization. Your data resides in S3.

To refine recommendations, you plan to retrain recommendation models using fresh customer interaction data whenever a new file is added to S3. But how exactly do you approach this task?

Unless otherwise noted, all images are by the author

Solutions

Two common solutions to this problem are:

  1. AWS Lambda: A serverless compute service by AWS, allowing code execution in response to events without managing servers.

  2. Open-source orchestrators: Tools automating, scheduling, and monitoring workflows and tasks, usually self-hosted.

Using an open-source orchestrator offers advantages over AWS Lambda:

  • Cost-Effectiveness: Running long tasks on AWS Lambda can be costly. Open-source orchestrators let you use your infrastructure, potentially saving costs.

  • Faster Iteration: Developing and testing workflows locally speeds up the process, making it easier to debug and refine.

  • Environment Control: Full control over the execution environment allows you to customize your development tools and IDEs to match your preferences.

While you could solve this problem in Apache Airflow, it would require complex infrastructure and deployment setup. Thus, we’ll use Kestra, which offers an intuitive UI and can be launched in a single Docker command.

Feel free to play and fork the source code of this article here:

GitHub - khuyentran1401/mlops-kestra-workflow

Contribute to khuyentran1401/mlops-kestra-workflow development by creating an account on GitHub.

github.com

Workflow Summary

This workflow consists of two main components: Python scripts and orchestration.

Orchestration

  • Python scripts and flows are stored in Git, with changes synced to Kestra on a schedule.

  • When a new file appears in the “new” prefix of the S3 bucket, Kestra triggers an execution of a series of Python scripts.

Python Scripts

As we will execute the code downloaded from Git within Kestra, make sure to commit these Python scripts to the repository.

git add .
git commit -m 'add python scripts'
git push origin main

Orchestration

Start Kestra

Download the Docker Compose file by executing the following command:

curl -o docker-compose.yml \
https://raw.githubusercontent.com/kestra-io/kestra/develop/docker-compose.yml

Ensure that Docker is running. Then, start the Kestra server with the following command:

docker-compose up -d

Access the UI by opening the URL http://localhost:8080 in your browser.

Sync from Git

Since the Python scripts are in GitHub, we will use Git Sync to update the code from GitHub to Kestra every minute. To set this up, create a file named “sync_from_git.yml” under the “_flows” directory.

.
├── _flows/
│   └── sync_from_git.yml
└── src/
    ├── download_files_from_s3.py
    ├── helpers.py
    ├── merge_data.py
    ├── process.py
    └── train.py