IMDb Movie Analytics Pipeline

Pipeline Architecture — 10 Phases

↓ Click any phase card to see full details + real-world use cases ↓

🗄️

Phase 1

Data Lake

AWS S3

3 buckets
Raw/Processed/Features
Bronze-Silver-Gold

click for details

→

⚙️

Phase 2

ETL Pipeline

AWS Glue

PySpark
12.3M → 309K
Parquet by decade

click for details

→

🔍

Phase 3

Warehouse

AWS Athena

Serverless SQL
6 analytical views
Redshift workaround

click for details

→

🔧

Phase 4-5

ML Features

Glue + XGBoost

43 features
Director reputation
500MB in cloud

click for details

→

🤖

Phase 5-6

Train + Deploy

SageMaker

ml.m5.xlarge
RMSE=0.786
<100ms latency

click for details

→

📊

Phase 7-8

Dashboards

Grafana + CW

5 dashboards
CloudWatch alarms
SNS alerts

click for details

→

⚡

Phase 9

Automation

Lambda + EB

Daily 2am UTC
~16 min total
Zero intervention

click for details

📈

Model Iterations

R² improvement across versions

0.306

baseline

0.306

+interact

0.462

50k filter

0.664

+director

KEY INSIGHT

Director reputation was the missing signal. Adding 6 director features → 117% R² improvement

🎯

Live Endpoint Predictions

Actual vs predicted ratings

Schindler's List

9.0

→

9.11

±0.11

The Notebook

7.8

→

7.88

±0.08

The Dark Knight

9.0

→

8.58

±0.42

Transformers

7.0

→

7.33

±0.33

Paranormal Activity

6.3

→

5.42

±0.88

Indie Drama 2023

N/A

→

5.60

new

0.40

avg error

<100ms

latency

☁️

AWS Services Used

Full stack breakdown

Data lake

Glue

ETL / PySpark

Athena

SQL warehouse

SageMaker

Train + serve

Lambda

Orchestrator

EventBridge

Scheduler

CloudWatch

Monitoring

SNS

Alerts

Pipeline Run Duration

ETL

193s

Features

60s

Director

671s

Total

~16m

Student Account Workarounds — Engineering Judgment

⛔ Blocked

Amazon Redshift

OptInRequired error on student account.
→ Replaced with Athena — same SQL, serverless, cheaper for intermittent queries. DDL preserved for migration.

⛔ Blocked

Amazon QuickSight

Credit card required even for free trial.
→ Replaced with Grafana Cloud — free tier, public URLs, more relevant for DE roles.

⚠️ Bottleneck

500MB Local Download

title.principals.tsv.gz took 15+ min to download every run.
→ Moved processing to Glue — cloud processes 500MB, downloads only 17MB output.

Impact — What This Project Demonstrates

117%

Model Improvement via Domain Knowledge

Iterated through 4 model versions. Identified that director reputation — not more feature engineering — was the missing signal. R² went from 0.306 to 0.664.

97.5%

Data Reduction with Quality Preservation

Filtered 12.3M raw records down to 309K clean movies — 97.5% reduction — while maintaining 99%+ completeness on all critical fields through quality thresholds.

Manual Interventions After Deployment

EventBridge + Lambda orchestration runs the full 16-minute pipeline daily at 2am UTC with automatic SNS failure alerts — zero human involvement required.

10x

Storage Compression via Parquet

Converting raw TSV files to Parquet with decade partitioning achieves 10x compression. Athena queries scan only relevant partitions, reducing cost and latency.

±0.11

Best Prediction Error on Famous Films

Live SageMaker endpoint predicted Schindler's List at 9.11 vs 9.0 actual. The Notebook at 7.88 vs 7.8. Model generalizes well to unseen blockbusters.

~$1

Full Pipeline Cost on AWS

Entire project — ETL, ML training, endpoint deployment, monitoring — cost under $1 total using serverless-first architecture (Athena, Lambda, Glue on-demand).

IMDb Movie AnalyticsArchitecture

IMDb Movie Analytics
Architecture