The Challenge
This company operates one of the world's largest independent advertising platforms, powering retargeting, prospecting, and email marketing campaigns for thousands of businesses globally.
The Data Team was responsible for the infrastructure that processed, transformed, and served the data behind all ad optimization decisions. The scale was staggering:
- 1.5 billion records ingested per day from ad impressions, clicks, conversions, and attribution events
- Petabyte-level data storage across multiple data stores and processing layers
- Real-time requirements — ad optimization models needed fresh data to make bidding decisions in milliseconds
- Cost sensitivity — at this scale, even small inefficiencies in processing translate to significant cloud spend
The core challenge wasn't just handling the volume. It was building systems that could process data at this scale reliably, cost-effectively, and fast enough to power real-time business decisions.
My Role & Approach
I joined the Data Team as a Data Software Engineer, working primarily with Scala, Java, JavaScript, and Python across the full pipeline stack.
Pipeline Architecture
I built and maintained data pipelines across the entire data lifecycle — ingestion from ad exchange events, transformation and enrichment, aggregation for reporting, and serving to downstream consumers including the ML models that powered ad optimization.
The stack was primarily AWS-based: EMR (Hadoop/Spark) for batch processing, S3 for storage, DynamoDB for low-latency serving, RDS for relational data, and Lambda and Batch for orchestration and event-driven processing.
Performance Engineering
At petabyte scale, performance engineering isn't optional — it's survival. I worked extensively on:
- Shuffle optimization — partition sizing, join strategies, and broadcast optimizations to keep Spark jobs healthy and cost-effective
- Memory management — diagnosing and resolving OOM errors, configuring spill-to-disk behavior, and tuning memory allocation for large joins
- Cost optimization — refactoring pipelines to reduce EMR cluster runtime, optimizing S3 storage patterns, and right-sizing compute resources
Reliability at Scale
When your pipelines process 1.5 billion records daily, failures have immediate business impact — stale data means suboptimal ad bids, which means lost revenue for thousands of customers. I focused on building pipelines that were not just fast, but resilient: idempotent processing, automated recovery, comprehensive monitoring, and alerting that caught issues before they cascaded.
Results
The data infrastructure I helped build and maintain was the foundation for the platform's entire ad optimization engine — every bidding decision, every attribution calculation, every campaign report flowed through these pipelines.
Tech Stack
Key Takeaway
Working at this scale taught me that data engineering at petabyte level is fundamentally about economics and reliability, not just technology. Every architectural decision — partition strategy, storage format, processing cadence — has a direct cost and reliability impact. The best pipeline isn't the most sophisticated one; it's the one that runs predictably, recovers gracefully, and costs what you expect it to cost.
This experience is the foundation of how I approach every data architecture engagement today: start with the constraints (scale, latency, cost, reliability), then work backward to the simplest design that meets them.
Facing similar data scale challenges?
I help teams design and build data pipelines that work reliably at scale — without the overhead of a full-time hire.
Book a Discovery Call