117%
Model Improvement via Domain Knowledge
Iterated through 4 model versions. Identified that director reputation — not more feature engineering — was the missing signal. R² went from 0.306 to 0.664.
97.5%
Data Reduction with Quality Preservation
Filtered 12.3M raw records down to 309K clean movies — 97.5% reduction — while maintaining 99%+ completeness on all critical fields through quality thresholds.
0
Manual Interventions After Deployment
EventBridge + Lambda orchestration runs the full 16-minute pipeline daily at 2am UTC with automatic SNS failure alerts — zero human involvement required.
10x
Storage Compression via Parquet
Converting raw TSV files to Parquet with decade partitioning achieves 10x compression. Athena queries scan only relevant partitions, reducing cost and latency.
±0.11
Best Prediction Error on Famous Films
Live SageMaker endpoint predicted Schindler's List at 9.11 vs 9.0 actual. The Notebook at 7.88 vs 7.8. Model generalizes well to unseen blockbusters.
~$1
Full Pipeline Cost on AWS
Entire project — ETL, ML training, endpoint deployment, monitoring — cost under $1 total using serverless-first architecture (Athena, Lambda, Glue on-demand).