About Experience Projects Certifications Skills Recommendations Contact

Data Scientist | 3+ years in FinTech, Product Analytics | UH '25 Alum

About Me

I'm a Data Scientist and Data Engineer who builds end-to-end systems — from ingestion to inference. With an MS in Data Science from the University of Houston, I've shipped production pipelines at Accenture's fintech platform, genomics products that capture DNA from air, and ML models that turn raw data into decisions worth millions.

I work across the full stack: Kafka to Snowflake, dbt to SageMaker, XGBoost to RAG. Whether it's cutting feature prep from 6 hours to 23 seconds or reviewing 100+ ads/min with zero hallucinations — I build things that scale and insights that land.

I tend to overfit on Fridays, fine-tune on Saturdays, and regularize on Sundays. Currently seeking Data Engineer, Data Scientist, and AI/ML Engineer roles where the data is messy and the impact is real.

Work Experience

Data Science Intern

Wild Genomics, CA

May 2025 - Aug 2025
50% Increase in Analysis-Ready Data
90% Reduction in Redundant Data
40% Runtime Reduction
  • Partnered with research scientists to address low-quality sequencing data blocking critical insights, designed automated Python pipeline with filtering & QC that increased analysis-ready data from 30% to 80%, enabling accurate predictive modeling for conservation decisions
  • Reduced data redundancy by ~90% through similarity-based clustering (97-99% identity), improving feature reliability for downstream ML models and accelerating time-to-insight for cross-sample biodiversity analysis
  • Enhanced species classification accuracy from 85% to 95% using optimized BLAST similarity thresholds, enabling genus-level precision that supported grant funding decisions and improved research outcome predictions
Python Machine Learning Genomics Vsearch BLAST

Data Analyst

University of Houston, TX

Jul 2024 - Dec 2025
70% Processing Time Reduction
40K+ Records per Cycle
Real-Time Data Sync
  • Partnered with admissions leadership to support development of a predictive model analyzing 40K+ applications per cycle, identifying key graduation predictors, translated findings into automated admission thresholds that reduced manual processing by 70%, accelerated decisions by 5 days, and improved predicted graduate success rates by 12%
  • Built automated data pipelines integrating multiple admissions systems and designed Tableau dashboards enabling officers to prioritize high-risk applications, improving decision consistency and supporting evidence-based policy changes
Oracle PeopleSoft ETL pipeline Data Integration

Data Analyst

Accenture, India

Sep 2021 - Dec 2023
$25M Revenue Growth
80M+ Records Managed
30% Efficiency Improvement
  • Partnered with sales leadership to optimize SaaS data systems managing 80M+ records, streamlined analytics workflows and improved deployment speed by 30%, enabling faster identification of high-value leads and sales strategies
  • Led CRM data migration projects and developed SQL-based analytics pipelines that identified conversion bottlenecks and optimized sales funnels, contributing to $25M in revenue growth through improved lead prioritization
  • Defined KPIs with business stakeholders and automated Tableau dashboards monitoring feature adoption and deal conversions, delivering actionable insights that increased conversion rates by 10% and informed product roadmap decisions
Salesforce Java SQL CRM Analytics Jira Agile

Software Engineer Intern

Capgemini, India

Apr 2021 - Jul 2021
7 Data Models Designed
100K+ Records Managed
Full stack Development
  • Designed and implemented relational data models and normalized database schemas in Oracle, ensuring reliable storage and structured data ready for analytics
  • Developed backend services using Java and Spring Boot, supporting transactional data handling, multi-entity tracking, and operational workflows
  • Enhanced front-end interactivity using JavaScript, improving usability and visualization of operational data
Java Oracle Spring Boot JavaScipt

Featured Projects

Travel & Hospitality

Urban Mobility
Analytics Platform

End-to-end analytics pipeline for NYC taxi operations (8.6M+ trips, $180M+ quarterly revenue), combining batch and streaming architectures to enable real-time demand forecasting and dynamic pricing optimization.

Reduced feature prep time by 95% (6h → 23s) using Snowflake, dbt, and Dockerized Airflow with 37 automated validations maintaining 99.88% data quality.
🔄
Kafka streaming cut ingestion latency from 24h batches → <5 min, processing 10 msgs/sec for live demand signals.
📊
Tableau dashboards uncovered 30% demand spike at peak hours, 18% credit card revenue premium, and high-value corridor patterns driving driver allocation and pricing strategies.
Tech Stack
Snowflake dbt Airflow Kafka Docker Tableau
Media & Tech

Digital Ads
Risk Engine

AI-powered RAG system for digital advertising compliance, automating Google Ads policy review across 341 regulatory chunks from 25 documents to enable scalable, real-time ad moderation.

Hybrid retrieval (BM25 + BGE + reranking) achieved 80% accuracy, 78% Recall@5, reducing review time from 5–10 min to <1 sec per ad with <350ms query latency.
🤖
Gemini-powered structured outputs (Pydantic schemas) delivered zero hallucinations, 100% citation transparency, and automated risk scoring with confidence intervals.
📈
Compliance at scale — 1,000+ ads/sec capacity, reduced manual review costs, and cut policy violations by 40% through real-time risk detection.
Tech Stack
RAG LLM FAISS Embeddings Web Scraping Python
Retail

Grocery ML
& Recommendation Platform

End-to-end ML platform processing 10.6M orders to deliver personalized recommendations, churn prevention, and customer lifetime value prediction — identifying $4.65M in annual business opportunity across 7 customer segments.

🤖
5 models on 10.6M orders — reorder prediction 80% AUC, customer LTV 90% R², churn detection 77% AUC, RFM segmentation (7 segments), and 149 basket rules with up to 296x lift.
⚙️
SHAP explainability, 45+ engineered features, and data leakage detection that fixed a 100% → 77% AUC overfit. Hyperopt tuning added +2% AUC on churn model.
💰
Identified $4.65M annual opportunity — RFM strategies ($3.45M), basket bundles ($1.15M), churn prevention ($978K). Streamlit dashboard with What-If simulator and ROI calculator.
Tech Stack
Databricks XGBoost MLflow Hyperopt Azure Apache Spark SHAP Streamlit
Media & Entertainment

Movie Data
Analytics Platform

End-to-end AWS data pipeline processing 12.3M IMDb records with XGBoost movie rating prediction (R²=0.664) — combining batch ETL, feature engineering, ML training, real-time inference, and automated orchestration into a single production-grade system.

⚙️
AWS Glue PySpark pipeline reduced 12.3M records → 309K clean movies with 99%+ data quality. Decade-partitioned Parquet achieved 10x compression over CSV with zero-cost Athena querying.
🤖
Engineered 43 features across 4 XGBoost iterations — improved R² from 0.306 → 0.664. SageMaker endpoint predicts Schindler's List at 9.11 vs 9.0 actual.
Automated via EventBridge + Lambda on daily schedule — 3 Glue jobs in sequence, ~16 min runtime, zero manual intervention. Grafana + SNS alerting on failures.
Tech Stack
AWS Glue SageMaker Athena XGBoost EventBridge Lambda Grafana PySpark

Certifications

Building Transformer based NLP Apps

Nvidia

OCI Gen AI Professional

Oracle

Data Visualization in Excel & Power BI

HPE DSI

SQL Advanced

HackerRank

Skills & Technologies

Programming

  • Python
  • SQL
  • R
  • Java
  • PostgreSQL
  • NoSQL

Data Engineering

  • Snowflake
  • Databricks
  • Apache Spark
  • dbt
  • Airflow
  • Docker
  • Kubernetes
  • Kafka

AI/ML

  • Machine Learning
  • Deep Learning
  • NLP & Transformers
  • LLM & RAG
  • Generative AI
  • MLOps
  • SageMaker, MLflow

Cloud & Tools

  • AWS
  • Azure
  • Tableau
  • Power BI
  • Advanced Excel
  • Grafana

Education

M.S - Data Science

University of Houston

Houston, TX, USA

GPA: 3.7/4.0

B.E - Electronics Engineering

Jawaharlal Nehru Technological University

Hyderabad, INDIA

GPA: 3.68/4.0

Recommendations

"

I had the pleasure of working with Varun during his internship at Wild Genomics as a Data Science Intern. From day one, Varun impressed us with his technical acumen, professionalism, and eagerness to contribute meaningfully to our mission. He developed and optimized bioinformatics pipelines for complex environmental DNA datasets, demonstrating strong skills in Python, R, and machine learning. What stood out most was Varun's ability to learn quickly, adapt to new challenges, and consistently deliver results. He brought positive energy to the team and demonstrated strong initiative while remaining collaborative. I wholeheartedly recommend him to any organization looking for a talented and driven data scientist

Eirik Torheim

CEO, Wild Genomics

View on LinkedIn →
"

It's been an absolute pleasure working with Varun during his internship at Wild Genomics. Varun approached his role with professionalism, curiosity, and a strong drive to learn. He led the design and implementation of an end-to-end bioinformatics pipeline for processing complex, multi-marker eDNA sequencing datasets, delivering a solution that is both robust and scalable. Varun stood out for his teamwork, clear communication, and creative problem-solving. The pipeline he developed is already revealing species-level insights from airborne eDNA that traditional methods would struggle to capture.

Bilgenur Baloglu

CSO, Wild Genomics

View on LinkedIn →

Get In Touch

I'm actively seeking opportunities in data science, data engineering, and ML engineering roles. Let's connect!

📍 Houston, TX | 📞 (713) 539-6996