I just updated the CatInCloud Labs project page with a major expansion: The "Core & Chaos" portfolio is live. 馃樇
I'm now modeling options chains for the entire Magnificent Seven (NVDA, AAPL, MSFT...), the Macro Heavyweights (SPY, QQQ, IWM, TLT), plus the Chaos Engines (GME, VIX, IBIT, and RDDT).
When market data needs to cover the most traded ETFs, the biggest tech giants, and the most volatile meme assets simultaneously, a fragile data pipeline isn't gonna cut it. This brings me back to the core architectural decision for this project: Why choose a "boring" stack like AWS MWAA + Snowflake + dbt?
The Problem: Scale meets Chaos
Market data is a unique engineering challenge because it combines massive daily volume with explosive volatility. SPY churns out millions of rows each day. GME will be dormant for months, then burst with 10x volume in a single afternoon.
To handle this, I needed a stack that provided the ultimate engineering luxury: predictability.
1. Orchestration: Why AWS MWAA?
A simple cron job is too fragile to handle complex dependencies across 15 asset classes. If the raw data for NVDA lands late, I need the transformation layer to wait, not fail silently.
On the other hand, managing a Kubernetes cluster just to run Airflow adds a ton of operational overhead. I chose Amazon Managed Workflows for Apache Airflow (MWAA) because it decouples infrastructure maintenance from business logic:
- Managed Environment: AWS handles the scaling of the scheduler and web server. I don't patch servers. I just deploy DAGs.
- Complex Dependencies: Airflow allows me to define strict sensors and triggers. The pipeline knows exactly what data is required before it attempts a run.
- Security: It runs entirely inside a private VPC, keeping the control plane isolated from the public internet.
2. Compute: Why Snowflake?
When you are ingesting full options chains, the row counts get heavy. Fast. Managing the compute resources for a local ingest or on an EC2 instance is a constant balancing act between over-provisioning (wasting money) and under-provisioning (crashing pipelines).
Snowflake solves this via the separation of storage and compute. I use a dedicated virtual warehouse for ingestion that scales up instantly to chew through the massive SPY/NVDA files, then immediately spins back down. It doesn't matter if GME has 1 record or 1 million records on a given day; Snowflake scales to fit the load without human intervention, and I only pay for the seconds the warehouse is running.
3. The Engineering Superpower: Idempotent by Design
In data engineering, things break. APIs timeout, files arrive corrupted, or business logic changes mid-flight. A calm pipeline is built to handle the break.
In a fragile system, a failure usually requires manual cleanup: "Did step 3 finish? Do I need to delete the partial rows for today? If I re-run this, will I get duplicates?"
I designed this stack around idempotency down to the DAG level. This means I can clear any Airflow task, re-run it, and be certain the result will be exactly the same as if it had run perfectly the first time.
How this works in practice:
- Airflow Tasks: Every task is templated using the execution date. When the ingestion task runs for
2025-12-22, it doesn't just append data. It explicitly targets the partition for that specific date. If I re-run the task, it overwrites that specific slice of time. It is impossible to accidentally double-load data by clicking "clear" in the UI. - dbt Transformations: dbt (data build tool) is the other half of this equation. Because dbt models are declarative (
SELECTstatements) rather than imperative (INSERT/UPDATEscripts), it handles the complexity of replacing tables or cleanly merging incremental data.
This design choice drastically reduces the Mean Time To Recovery. If a pipeline fails at 3:00 AM, I don't have to perform surgery on the database. I fix the upstream issue, clear the DAG, and go back to sleep. The system heals itself.
Looking Ahead: What happens when we need it faster?
The most common critique of this batch-based architecture is latency. "What if a client needs real-time Gammas?"
If that requirement comes, a reactive architect might tear this all down to build a complex streaming solution. A calm architect evolves the foundation:
- Ingest: Swap S3 batch loads for Snowpipe Streaming.
- Transform: Swap hourly dbt jobs for Snowflake Dynamic Tables.
- Result: Latency drops from 24 hours to 60 seconds, using largely the same SQL semantics and business rules.
Calm engineering isn't about refusing to scale. It's about building a foundation strong enough to hold the entire market today, but flexible enough to evolve tomorrow.
Check out the full 15-ticker dashboard live on the site: 馃悎 catincloudlabs.com/projects
Engineering Note: MWAA Dependency Hell
Speaking of "dependency hell": if anyone else is currently fighting with Python library conflicts in MWAA, I'm happy to save you a weekend of debugging.
Here is the constraints file I built to keep the environment stable (MWAA 3.0.6 / Python 3.12). It took two days of trial and error to get this right. Feel free to steal it 馃樃
馃搫 Download Constraints File (MWAA 3.0.6 / Py3.12)