Varun Vaddi - Data Scientist

About Me

I'm a Data Scientist and Data Engineer who builds end-to-end systems — from ingestion to inference. With an MS in Data Science from the University of Houston, I've shipped production pipelines at Accenture's fintech platform, genomics products that capture DNA from air, and ML models that turn raw data into decisions worth millions.

I work across the full stack: Kafka to Snowflake, dbt to SageMaker, XGBoost to RAG. Whether it's cutting feature prep from 6 hours to 23 seconds or reviewing 100+ ads/min with zero hallucinations — I build things that scale and insights that land.

I tend to overfit on Fridays, fine-tune on Saturdays, and regularize on Sundays. Currently seeking Data Engineer, Data Scientist, and AI/ML Engineer roles where the data is messy and the impact is real.

Work Experience

Data Science Intern

Wild Genomics, CA

May 2025 - Aug 2025

                                50%
                                Increase in Analysis-Ready Data
                            

                                90%
                                Reduction in Redundant Data
                            

                                40%
                                Runtime Reduction
                            

Partnered with research scientists to address low-quality sequencing data blocking critical insights, designed automated Python pipeline with filtering & QC that increased analysis-ready data from 30% to 80%, enabling accurate predictive modeling for conservation decisions
Reduced data redundancy by ~90% through similarity-based clustering (97-99% identity), improving feature reliability for downstream ML models and accelerating time-to-insight for cross-sample biodiversity analysis
Enhanced species classification accuracy from 85% to 95% using optimized BLAST similarity thresholds, enabling genus-level precision that supported grant funding decisions and improved research outcome predictions

Python Machine Learning Genomics Vsearch BLAST

Data Analyst

University of Houston, TX

Jul 2024 - Dec 2025

                                70%
                                Processing Time Reduction
                            

                                40K+
                                Records per Cycle
                            

                                Real-Time
                                Data Sync
                            

Partnered with admissions leadership to support development of a predictive model analyzing 40K+ applications per cycle, identifying key graduation predictors, translated findings into automated admission thresholds that reduced manual processing by 70%, accelerated decisions by 5 days, and improved predicted graduate success rates by 12%
Built automated data pipelines integrating multiple admissions systems and designed Tableau dashboards enabling officers to prioritize high-risk applications, improving decision consistency and supporting evidence-based policy changes

Oracle PeopleSoft ETL pipeline Data Integration

Data Analyst

Accenture, India

Sep 2021 - Dec 2023

                                $25M
                                Revenue Growth
                            

                                80M+
                                Records Managed
                            

                                30%
                                Efficiency Improvement
                            

Partnered with sales leadership to optimize SaaS data systems managing 80M+ records, streamlined analytics workflows and improved deployment speed by 30%, enabling faster identification of high-value leads and sales strategies
Led CRM data migration projects and developed SQL-based analytics pipelines that identified conversion bottlenecks and optimized sales funnels, contributing to $25M in revenue growth through improved lead prioritization
Defined KPIs with business stakeholders and automated Tableau dashboards monitoring feature adoption and deal conversions, delivering actionable insights that increased conversion rates by 10% and informed product roadmap decisions

Salesforce Java SQL CRM Analytics Jira Agile

Software Engineer Intern

Capgemini, India

Apr 2021 - Jul 2021

                                7
                                Data Models Designed
                            

                                100K+
                                Records Managed
                            

                                Full stack
                                Development
                            

Designed and implemented relational data models and normalized database schemas in Oracle, ensuring reliable storage and structured data ready for analytics
Developed backend services using Java and Spring Boot, supporting transactional data handling, multi-entity tracking, and operational workflows
Enhanced front-end interactivity using JavaScript, improving usability and visualization of operational data

Java Oracle Spring Boot JavaScipt

Featured Projects

Travel & Hospitality

Urban Mobility
Analytics Platform

End-to-end analytics pipeline for NYC taxi operations (8.6M+ trips, $180M+ quarterly revenue), combining batch and streaming architectures to enable real-time demand forecasting and dynamic pricing optimization.

Dashboards Dataset GitHub Architecture

⚡

Performance

Reduced feature prep time by 95% (6h → 23s) using Snowflake, dbt, and Dockerized Airflow with 37 automated validations maintaining 99.88% data quality.

🔄

Real-time Integration

Kafka streaming cut ingestion latency from 24h batches → <5 min, processing 10 msgs/sec for live demand signals.

📊

Business Impact

Tableau dashboards uncovered 30% demand spike at peak hours, 18% credit card revenue premium, and high-value corridor patterns driving driver allocation and pricing strategies.

Tech Stack

Snowflake dbt Airflow Kafka Docker Tableau

Media & Tech

Digital Ads
Risk Engine

AI-powered RAG system for digital advertising compliance, automating Google Ads policy review across 341 regulatory chunks from 25 documents to enable scalable, real-time ad moderation.

Demo Dataset GitHub Architecture

⚡

Performance

Hybrid retrieval (BM25 + BGE + reranking) achieved 80% accuracy, 78% Recall@5, reducing review time from 5–10 min to <1 sec per ad with <350ms query latency.

🤖

LLM Integration

Gemini-powered structured outputs (Pydantic schemas) delivered zero hallucinations, 100% citation transparency, and automated risk scoring with confidence intervals.

📈

Business Impact

Compliance at scale — 1,000+ ads/sec capacity, reduced manual review costs, and cut policy violations by 40% through real-time risk detection.

Tech Stack

RAG LLM FAISS Embeddings Web Scraping Python

Retail

Grocery ML
& Recommendation Platform

End-to-end ML platform processing 10.6M orders to deliver personalized recommendations, churn prevention, and customer lifetime value prediction — identifying $4.65M in annual business opportunity across 7 customer segments.

Demo Dataset GitHub Architecture

🤖

ML Models

5 models on 10.6M orders — reorder prediction 80% AUC, customer LTV 90% R², churn detection 77% AUC, RFM segmentation (7 segments), and 149 basket rules with up to 296x lift.

⚙️

Advanced Techniques

SHAP explainability, 45+ engineered features, and data leakage detection that fixed a 100% → 77% AUC overfit. Hyperopt tuning added +2% AUC on churn model.

💰

Business Impact

Identified $4.65M annual opportunity — RFM strategies ($3.45M), basket bundles ($1.15M), churn prevention ($978K). Streamlit dashboard with What-If simulator and ROI calculator.

Tech Stack

Databricks XGBoost MLflow Hyperopt Azure Apache Spark SHAP Streamlit

Media & Entertainment

Movie Data
Analytics Platform

End-to-end AWS data pipeline processing 12.3M IMDb records with XGBoost movie rating prediction (R²=0.664) — combining batch ETL, feature engineering, ML training, real-time inference, and automated orchestration into a single production-grade system.

Dashboards Dataset GitHub Architecture

⚙️

Data Engineering

AWS Glue PySpark pipeline reduced 12.3M records → 309K clean movies with 99%+ data quality. Decade-partitioned Parquet achieved 10x compression over CSV with zero-cost Athena querying.

🤖

Machine Learning

Engineered 43 features across 4 XGBoost iterations — improved R² from 0.306 → 0.664. SageMaker endpoint predicts Schindler's List at 9.11 vs 9.0 actual.

⚡

Orchestration

Automated via EventBridge + Lambda on daily schedule — 3 Glue jobs in sequence, ~16 min runtime, zero manual intervention. Grafana + SNS alerting on failures.

Tech Stack

AWS Glue SageMaker Athena XGBoost EventBridge Lambda Grafana PySpark

Recommendations

I had the pleasure of working with Varun during his internship at Wild Genomics as a Data Science Intern. From day one, Varun impressed us with his technical acumen, professionalism, and eagerness to contribute meaningfully to our mission. He developed and optimized bioinformatics pipelines for complex environmental DNA datasets, demonstrating strong skills in Python, R, and machine learning. What stood out most was Varun's ability to learn quickly, adapt to new challenges, and consistently deliver results. He brought positive energy to the team and demonstrated strong initiative while remaining collaborative. I wholeheartedly recommend him to any organization looking for a talented and driven data scientist

Eirik Torheim

CEO, Wild Genomics

View on LinkedIn →

It's been an absolute pleasure working with Varun during his internship at Wild Genomics. Varun approached his role with professionalism, curiosity, and a strong drive to learn. He led the design and implementation of an end-to-end bioinformatics pipeline for processing complex, multi-marker eDNA sequencing datasets, delivering a solution that is both robust and scalable. Varun stood out for his teamwork, clear communication, and creative problem-solving. The pipeline he developed is already revealing species-level insights from airborne eDNA that traditional methods would struggle to capture.

Bilgenur Baloglu

CSO, Wild Genomics

View on LinkedIn →

About Me

Work Experience

Data Science Intern

Data Analyst

Data Analyst

Software Engineer Intern

Featured Projects

Urban Mobility
Analytics Platform

Digital Ads
Risk Engine

Grocery ML
& Recommendation Platform

Movie Data
Analytics Platform

Certifications

Building Transformer based NLP Apps

OCI Gen AI Professional

Data Visualization in Excel & Power BI

SQL Advanced

Skills & Technologies

Programming

Data Engineering

AI/ML

Cloud & Tools

Education

M.S - Data Science

B.E - Electronics Engineering

Recommendations

Eirik Torheim

Bilgenur Baloglu

Get In Touch

About Me

Work Experience

Data Science Intern

Data Analyst

Data Analyst

Software Engineer Intern

Featured Projects

Urban MobilityAnalytics Platform

Digital AdsRisk Engine

Grocery ML& Recommendation Platform

Movie DataAnalytics Platform

Certifications

Building Transformer based NLP Apps

OCI Gen AI Professional

Data Visualization in Excel & Power BI

SQL Advanced

Skills & Technologies

Programming

Data Engineering

AI/ML

Cloud & Tools

Education

M.S - Data Science

B.E - Electronics Engineering

Recommendations

Eirik Torheim

Bilgenur Baloglu

Get In Touch

Urban Mobility
Analytics Platform

Digital Ads
Risk Engine

Grocery ML
& Recommendation Platform

Movie Data
Analytics Platform