Case Study: Petabyte-Scale Ad Tech Pipelines

The Challenge

This company operates one of the world's largest independent advertising platforms, powering retargeting, prospecting, and email marketing campaigns for thousands of businesses globally.

The Data Team was responsible for the infrastructure that processed, transformed, and served the data behind all ad optimization decisions. The scale was staggering:

1.5 billion records ingested per day from ad impressions, clicks, conversions, and attribution events
Petabyte-level data storage across multiple data stores and processing layers
Real-time requirements — ad optimization models needed fresh data to make bidding decisions in milliseconds
Cost sensitivity — at this scale, even small inefficiencies in processing translate to significant cloud spend

The core challenge wasn't just handling the volume. It was building systems that could process data at this scale reliably, cost-effectively, and fast enough to power real-time business decisions.

My Role & Approach

I joined the Data Team as a Data Software Engineer, working primarily with Scala, Java, JavaScript, and Python across the full pipeline stack.

Pipeline Architecture

I built and maintained data pipelines across the entire data lifecycle — ingestion from ad exchange events, transformation and enrichment, aggregation for reporting, and serving to downstream consumers including the ML models that powered ad optimization.

The stack was primarily AWS-based: EMR (Hadoop/Spark) for batch processing, S3 for storage, DynamoDB for low-latency serving, RDS for relational data, and Lambda and Batch for orchestration and event-driven processing.

Performance Engineering

At petabyte scale, performance engineering isn't optional — it's survival. I worked extensively on:

Shuffle optimization — partition sizing, join strategies, and broadcast optimizations to keep Spark jobs healthy and cost-effective
Memory management — diagnosing and resolving OOM errors, configuring spill-to-disk behavior, and tuning memory allocation for large joins
Cost optimization — refactoring pipelines to reduce EMR cluster runtime, optimizing S3 storage patterns, and right-sizing compute resources

Reliability at Scale

When your pipelines process 1.5 billion records daily, failures have immediate business impact — stale data means suboptimal ad bids, which means lost revenue for thousands of customers. I focused on building pipelines that were not just fast, but resilient: idempotent processing, automated recovery, comprehensive monitoring, and alerting that caught issues before they cascaded.

Results

1.5B Records processed per day, reliably

3+ yrs Sustained operation at petabyte scale

1000s Businesses powered by the data platform

The data infrastructure I helped build and maintain was the foundation for the platform's entire ad optimization engine — every bidding decision, every attribution calculation, every campaign report flowed through these pipelines.

Tech Stack

Processing: Apache Spark on AWS EMR (Hadoop)

Languages: Scala, Java, JavaScript, Python

Storage: AWS S3 (petabyte-scale data lake)

Databases: DynamoDB, RDS

Compute: AWS Batch, Lambda, EC2

Domain: Adtech — retargeting, prospecting, email marketing

Key Takeaway

Working at this scale taught me that data engineering at petabyte level is fundamentally about economics and reliability, not just technology. Every architectural decision — partition strategy, storage format, processing cadence — has a direct cost and reliability impact. The best pipeline isn't the most sophisticated one; it's the one that runs predictably, recovers gracefully, and costs what you expect it to cost.

This experience is the foundation of how I approach every data architecture engagement today: start with the constraints (scale, latency, cost, reliability), then work backward to the simplest design that meets them.

Processing 1.5 Billion Records Per Day at Petabyte Scale