AWS Data Engineering + ML Pipeline

IMDb Movie Analytics
Architecture

12.3M records → 309K clean movies → XGBoost R²=0.664 → Live predictions

12.3M
Raw Records
309K
Clean Movies
43
ML Features
0.664
Model R²
0.786
RMSE
~$1
Total Cost
Pipeline Architecture — 10 Phases
↓ Click any phase card to see full details + real-world use cases ↓
🗄️
Phase 1
Data Lake
AWS S3
3 buckets
Raw/Processed/Features
Bronze-Silver-Gold
click for details
⚙️
Phase 2
ETL Pipeline
AWS Glue
PySpark
12.3M → 309K
Parquet by decade
click for details
🔍
Phase 3
Warehouse
AWS Athena
Serverless SQL
6 analytical views
Redshift workaround
click for details
🔧
Phase 4-5
ML Features
Glue + XGBoost
43 features
Director reputation
500MB in cloud
click for details
🤖
Phase 5-6
Train + Deploy
SageMaker
ml.m5.xlarge
RMSE=0.786
<100ms latency
click for details
📊
Phase 7-8
Dashboards
Grafana + CW
5 dashboards
CloudWatch alarms
SNS alerts
click for details
Phase 9
Automation
Lambda + EB
Daily 2am UTC
~16 min total
Zero intervention
click for details
📈
Model Iterations
R² improvement across versions
v1
0.306
baseline
v2
0.306
+interact
v3
0.462
50k filter
v4
0.664
+director
KEY INSIGHT
Director reputation was the missing signal. Adding 6 director features → 117% R² improvement
🎯
Live Endpoint Predictions
Actual vs predicted ratings
Schindler's List
9.0
9.11
±0.11
The Notebook
7.8
7.88
±0.08
The Dark Knight
9.0
8.58
±0.42
Transformers
7.0
7.33
±0.33
Paranormal Activity
6.3
5.42
±0.88
Indie Drama 2023
N/A
5.60
new
0.40
avg error
<100ms
latency
☁️
AWS Services Used
Full stack breakdown
S3
Data lake
Glue
ETL / PySpark
Athena
SQL warehouse
SageMaker
Train + serve
Lambda
Orchestrator
EventBridge
Scheduler
CloudWatch
Monitoring
SNS
Alerts
Pipeline Run Duration
ETL
193s
Features
60s
Director
671s
Total
~16m
Student Account Workarounds — Engineering Judgment
⛔ Blocked
Amazon Redshift
OptInRequired error on student account.
Replaced with Athena — same SQL, serverless, cheaper for intermittent queries. DDL preserved for migration.
⛔ Blocked
Amazon QuickSight
Credit card required even for free trial.
Replaced with Grafana Cloud — free tier, public URLs, more relevant for DE roles.
⚠️ Bottleneck
500MB Local Download
title.principals.tsv.gz took 15+ min to download every run.
Moved processing to Glue — cloud processes 500MB, downloads only 17MB output.
Impact — What This Project Demonstrates
117%
Model Improvement via Domain Knowledge
Iterated through 4 model versions. Identified that director reputation — not more feature engineering — was the missing signal. R² went from 0.306 to 0.664.
97.5%
Data Reduction with Quality Preservation
Filtered 12.3M raw records down to 309K clean movies — 97.5% reduction — while maintaining 99%+ completeness on all critical fields through quality thresholds.
0
Manual Interventions After Deployment
EventBridge + Lambda orchestration runs the full 16-minute pipeline daily at 2am UTC with automatic SNS failure alerts — zero human involvement required.
10x
Storage Compression via Parquet
Converting raw TSV files to Parquet with decade partitioning achieves 10x compression. Athena queries scan only relevant partitions, reducing cost and latency.
±0.11
Best Prediction Error on Famous Films
Live SageMaker endpoint predicted Schindler's List at 9.11 vs 9.0 actual. The Notebook at 7.88 vs 7.8. Model generalizes well to unseen blockbusters.
~$1
Full Pipeline Cost on AWS
Entire project — ETL, ML training, endpoint deployment, monitoring — cost under $1 total using serverless-first architecture (Athena, Lambda, Glue on-demand).