How I Reduced a Petabyte Pipeline's Cost by 40%

At a top-10 ad tech platform processing 1.5 billion records per day, the cloud bill was becoming an existential concern. Not because the pipelines were broken — they worked. Data arrived on time, transformations were correct, downstream systems were fed. The problem was simpler and more dangerous: costs were growing faster than revenue.

The platform ran hundreds of Spark jobs across dozens of EMR clusters, processing petabytes of ad impression, click, and conversion data. Every day, these pipelines ingested raw event streams, enriched them with audience segments and campaign metadata, computed attribution models, and produced the reporting tables that powered billing and analytics. The infrastructure bill was north of seven figures annually — and climbing 15-20% quarter over quarter while revenue grew at 10%.

Leadership asked a straightforward question: "Can we cut 30% without breaking anything?"

I delivered 40%. Not through a single breakthrough, but through a systematic approach that treated every pipeline decision as an economic one. This is how.

Thinking in dollars, not milliseconds

The first thing I had to unlearn — and help the team unlearn — was the instinct to optimize for speed. In traditional software engineering, performance means latency. Faster is better. But at petabyte scale, faster is not always cheaper. Sometimes the fastest pipeline is the most expensive one, because it achieves speed by throwing compute at the problem.

Consider a Spark job that processes a day's worth of impression data. You can run it on 200 nodes and finish in 15 minutes, or on 50 nodes and finish in 45 minutes. The first option costs 4x more in compute but delivers the same result with the same downstream SLA. If your SLA is "data available by 8 AM" and both options finish before 8 AM, you're paying a 4x premium for speed nobody needs.

This mental shift — from "how fast can this run?" to "what's the cheapest way to meet the SLA?" — was the foundation of everything that followed. Every optimization decision became an economics question: what does this cost, what does the alternative cost, and what's the actual business constraint?

I built a cost model that attached a dollar figure to every major pipeline. Not just the compute bill — the full cost: compute, storage, data transfer, failed-run waste, and the engineering time spent babysitting unstable jobs. When you see that a single poorly-partitioned join is responsible for $14,000/month in shuffle-related compute, the optimization priority becomes obvious.

The four levers

After profiling every major pipeline and mapping costs to root causes, four optimization levers emerged. They aren't equally weighted — partition strategy was the single biggest lever, responsible for nearly half the total savings. But all four compound together.

Lever 1: Partition strategy

This was the single biggest win. The platform's pipelines had been built over years by different teams, each choosing partition keys based on intuition rather than data distribution analysis. The attribution pipeline — the most expensive job in the fleet — partitioned by advertiser ID. Reasonable in theory: you want to group all events for a given advertiser together for attribution computation.

The problem was that data distribution across advertisers followed a severe power law. The top 50 advertisers generated 40% of all impression volume. This meant that during the shuffle phase, a handful of partitions were 100x larger than the median. Those oversized partitions bottlenecked the entire job. Spark would finish 95% of tasks in 10 minutes, then spend another 40 minutes waiting for the few stragglers processing the giant partitions. The cluster sat mostly idle during that tail, burning compute for nothing.

The fix was a composite partition key: advertiser ID combined with a time bucket. For large advertisers, we subdivided their data into hourly chunks. For small advertisers, daily was fine. The shuffle volume dropped by 60%, and the job runtime went from 50 minutes to 18 minutes on the same cluster — or alternatively, we could run it on a smaller cluster in 30 minutes and cut compute costs by 55%.

Lever 2: Storage optimization

The platform stored intermediate data in row-oriented formats in several pipelines — a holdover from an era when the jobs were originally written. Switching to columnar storage with aggressive compression reduced storage footprint by 70% for those datasets. But the bigger win wasn't the storage bill itself — it was the read performance. Columnar formats with partition pruning meant that downstream jobs could read only the columns they needed, reducing I/O by 3-5x. Less I/O means shorter runtimes, which means less compute.

We also implemented lifecycle policies that moved data through storage tiers: hot storage for the last 7 days (frequently accessed for debugging and reprocessing), warm storage for 8-90 days, and cold archive beyond that. The raw event data alone was consuming significant storage spend on hot storage when 95% of it was never accessed after the first week.

Lever 3: Cluster right-sizing

Most teams over-provision by 2-3x. It's a rational behavior: if a job fails because the cluster was too small, you get paged at 3 AM. If a job succeeds because the cluster was too big, nobody notices the waste. The incentive structure rewards over-provisioning.

We fixed this by profiling every major job's actual resource utilization — CPU, memory, disk I/O, network — across two weeks of runs. The data was damning. The median cluster was using 35% of its provisioned memory and 45% of CPU. We right-sized aggressively, replacing on-demand instances with spot instances (with graceful handling for spot interruptions), and implemented autoscaling policies that matched cluster size to actual workload.

The key insight: compute-optimized instances for CPU-bound jobs (aggregations, joins), memory-optimized for jobs with large broadcast tables or caching requirements. The previous setup used one instance type for everything — the most expensive general-purpose option.

Lever 4: Pipeline refactoring

Several critical pipelines were monolithic: a single Spark application that read raw data, applied a dozen transformations, and wrote final output. When these jobs failed at step 10 of 12, the entire pipeline had to restart from scratch. At the data volumes we were processing, a single restart could waste tens of thousands of dollars in compute.

We split the worst offenders into discrete stages with checkpointed intermediate outputs. Each stage could be restarted independently. This didn't make the happy path faster, but it dramatically reduced waste from failures. When a downstream API timeout caused a job to fail at the enrichment stage, we restarted from the enrichment checkpoint instead of re-reading and re-processing petabytes of raw data.

The Spark-specific wins

Beyond the four structural levers, several Spark-specific optimizations produced outsized returns. These are the techniques I reach for first on any Spark cost optimization engagement.

Broadcast joins for dimension tables

The enrichment pipeline joined the massive impression stream (billions of rows) with several dimension tables: campaign metadata (~500K rows), audience segments (~2M rows), and advertiser configuration (~50K rows). These joins were implemented as standard shuffle joins, meaning both sides of the join were redistributed across the cluster. For every join, the entire impression dataset was shuffled — hundreds of gigabytes of network I/O for a join against a table that could fit in memory on a single executor.

Switching to broadcast joins for these dimension tables eliminated the shuffle entirely. The small tables were broadcast to every executor, and the join happened locally. The enrichment pipeline's runtime dropped by 40%, and network I/O dropped by over 70%. The total compute savings from this single change was substantial — it ran every hour across the entire fleet.

Partition pruning through write-time restructuring

Several downstream jobs read from a large events table but only needed data for specific date ranges and event types. The table was partitioned by date but not by event type. This meant every job had to read the entire day's data and filter in memory — scanning terabytes to use gigabytes.

We restructured the write path to add event type as a partition dimension. Downstream jobs that only needed click events could now read only the click partition, skipping impression and conversion data entirely. The read volume for some jobs dropped by 80%. The write path became slightly more complex, but the aggregate read savings dwarfed the write overhead since the data was written once and read dozens of times.

Adaptive query execution

Spark's adaptive query execution (AQE) was available but disabled on the platform — the original team had encountered a bug in an early version and never re-enabled it after it was fixed. Enabling AQE with properly tuned thresholds gave us automatic partition coalescing (merging small partitions), skew join optimization (splitting oversized partitions at join time), and dynamic shuffle partition sizing. These were "free" improvements — no code changes, just configuration. They shaved 8-12% off runtimes across the board.

Caching intermediate results

The attribution pipeline computed an intermediate dataset — enriched events with attribution weights — that was consumed by three separate downstream computations: billing aggregation, reporting rollups, and anomaly detection. Each downstream job re-derived this intermediate from raw data independently. Three jobs doing the same expensive work.

We refactored this into a shared computation: the intermediate dataset was materialized once and consumed by all three downstream jobs. The compute savings were close to 3x for that portion of the pipeline, and it simplified debugging because there was one version of the truth instead of three independently computed versions.

What doesn't work

Not every optimization is worth doing. Some of the most tempting improvements actually make things worse when you account for the full cost — including complexity, maintenance burden, and the risk of breaking a working pipeline.

Micro-optimizations that save 2% but add complexity. We had engineers eager to optimize UDF performance by rewriting Python logic in Scala. The improvement was real — about 3% on that specific stage — but it introduced a mixed-language codebase that most of the team couldn't debug. The maintenance cost exceeded the savings within three months. The rule I established: no optimization that requires specialized knowledge to maintain, unless the savings are significant enough to justify dedicated ownership.

Over-engineering compression that slows reads. We experimented with maximum-level compression on intermediate datasets. Storage costs dropped, but decompression time increased so much that the net effect was more expensive — the compute cost of slower reads exceeded the storage savings. The sweet spot was moderate compression: substantial storage reduction with negligible decompression overhead.

Premature caching. Caching sounds like a free win, but every cached dataset consumes executor memory. Cache the wrong things and you reduce the memory available for actual computation, causing spills to disk that slow everything down. We found that only datasets read more than twice within the same application benefited from caching. Everything else was cheaper to recompute.

The compound effect

The most important thing to understand about cost optimization at petabyte scale is that small improvements multiply. A 5% improvement on a single job run is unremarkable. But when that job runs hourly, across 30 clusters, 365 days a year, that 5% compounds into over a million optimized runs — and hundreds of thousands of dollars in annual savings.

Let me show the actual math from the engagement. The attribution pipeline ran hourly on 8 clusters. Before optimization, each run cost approximately $12 in compute. After partition rebalancing (40% reduction), broadcast joins (15% reduction on top of that), and cluster right-sizing (another 20%), the per-run cost dropped to roughly $5. That single pipeline — one of about 40 major pipelines — saved $490,000 annually.

The enrichment pipeline was similar: hourly runs, 12 clusters, $18 per run before optimization. Broadcast joins and partition pruning brought it to $9. Annual savings: another $570,000. And so on across the fleet.

The total 40% reduction came from dozens of individual optimizations, each modest on its own, compounding across frequency and fleet size. No single trick saved 40%. The system did.

This is why I'm skeptical of "one weird trick" optimization stories. At scale, sustainable cost reduction comes from systematically applying the right lever at every pipeline, not from finding a silver bullet.

The optimization playbook: where to start

When I'm handed a fleet of expensive Spark pipelines, I follow a specific diagnostic order. The order matters because it's ranked by impact — you want to fix the highest-leverage problems first, and each step's findings inform the next.

This order isn't arbitrary. Partition fixes reduce shuffle volume, which changes your cluster's resource profile, which changes whether refactoring is worthwhile. If you right-size the cluster before fixing partition skew, you'll have to right-size again after. If you refactor the pipeline into stages before fixing the joins, you might split in the wrong places. Start at the top of the funnel.

Documenting the economics

Here's the thing nobody tells you about cost optimization at scale: the savings are fragile. Every optimization decision exists for a reason — a specific data distribution, a particular access pattern, a measured resource profile. When those conditions change (and they will), the optimization may become counterproductive.

The partition strategy that saved 55% on the attribution pipeline was calibrated for a specific advertiser size distribution. If the platform onboards three massive new advertisers, the composite key boundaries need to shift. The broadcast join thresholds were set based on dimension table sizes — if those tables grow 10x, the broadcasts will cause out-of-memory failures.

I documented every optimization decision with three things: what we changed, why the original approach was expensive (with data), and when the optimization should be revisited. For the partition rebalancing: "Revisit when any single advertiser exceeds 8% of total daily volume." For the broadcast joins: "Revisit if any dimension table exceeds 2 GB." For the cluster sizing: "Re-profile quarterly or after any major workload change."

This documentation wasn't optional polish — it was a critical part of the optimization itself. Without it, the next engineer who encounters a performance problem won't understand why the partition key looks odd, won't know the thresholds, and will likely "fix" the optimization back to something that costs 40% more.

I've since formalized this into a structured knowledge base approach — where every infrastructure decision, its rationale, and its review triggers are captured in a cross-referenced, maintained system. You can read about that approach in The Knowledge Base Strategy. The principle is the same: institutional knowledge that isn't captured is institutional knowledge that's already being lost.

The bottom line

Cost optimization at petabyte scale is an engineering discipline, not a bag of tricks. The tools are straightforward — partition rebalancing, broadcast joins, cluster right-sizing, pipeline decomposition. The hard part is approaching it systematically: measuring before optimizing, understanding the economics of each decision, following the diagnostic order that maximizes impact, and documenting everything so the savings persist.

Here's what I'd tell any team staring at a cloud bill that's growing faster than revenue:

Start with the cost model. You can't optimize what you can't measure. Attach dollar figures to every major pipeline before you touch any code.
Fix partitions first. In every engagement I've done, partition skew is the single biggest source of waste. It's also the most satisfying to fix — the before/after numbers are dramatic.
Think in SLAs, not speed. The cheapest pipeline that meets the business constraint is the right pipeline. Faster than necessary is waste.
Small improvements compound. Don't chase silver bullets. A systematic 5% improvement across 40 pipelines running hourly produces more savings than a heroic 50% improvement on one job.
Document the economics. Every optimization has a shelf life. Capture the assumptions so future engineers know when to revisit.

Leadership asked for 30%. We delivered 40%. Not because we found a clever trick, but because we treated every pipeline as an economic system and optimized the system, not the parts.

Pipeline costs growing faster than your business?

I've optimized petabyte-scale pipelines and I can help you find the wins hiding in your infrastructure.

Book a Discovery Call Read the Case Study