Building a Data Platform From Zero: A Field Guide

A fast-growing logistics company in Brazil. Revenue doubling year over year. Forty-person engineering team. And not a single dashboard anyone trusted.

When I arrived, the company's data lived across four separate systems: a relational database backing the core product, a document store for driver and route information, flat files exported from a legacy warehouse management tool, and a third-party analytics platform that nobody maintained. The CEO made decisions by asking three different people for the same number and averaging the answers. The operations team ran reports by exporting CSVs and pasting them into spreadsheets. Finance reconciled revenue by hand every month because no two systems agreed on the total.

There was no data team. No warehouse. No pipelines. No orchestration. No monitoring. The company had grown from 10 to 200 employees in three years, and the data infrastructure hadn't grown at all.

My job was to build a data platform from zero. Not to evaluate vendors or write a strategy deck. To actually build it — architecture, implementation, pipelines, models, dashboards — and then hand it off to the team that would maintain it after I left.

This is what I learned. Not the theory, but the field guide: the decisions that matter, the order that actually works, and the mistakes that everyone makes (including me).

The first question: do you even need a platform?

Before you design anything, you need to honestly answer whether a data platform is the right solution. Not every company needs one. A platform introduces complexity — pipelines to maintain, infrastructure to monitor, schemas to evolve, people to hire. If your actual need is "one dashboard for the CEO," you might be better served by a BI tool connected directly to your production database with a read replica.

Here's the decision framework I use:

The logistics company checked every box. Four heterogeneous sources. Business leaders asking for analytics daily. Data volumes growing 40% quarter over quarter. And a committed CTO who was ready to hire a data engineer once the platform was built. So we built it.

Architecture: the four layers

Every data platform I've built — and I've built several — converges on the same four-layer architecture. Not because it's theoretically elegant, but because it's the simplest structure that solves the actual problems you'll face.

The Lake: your insurance policy

The lake stores raw data exactly as it arrived from source systems. No transformations, no cleaning, no modeling. Just immutable, append-only copies. This is your insurance policy. When (not if) you discover that a transformation was wrong, or a business rule changed retroactively, or someone needs data you didn't think was important six months ago, the lake lets you go back to the original and re-derive everything.

I've seen teams skip the lake to save time. Every single one regretted it within six months. The cost of storing raw data is trivial compared to the cost of losing it.

The Warehouse: the single source of truth

The warehouse contains modeled, tested, documented analytical tables. This is where raw data becomes useful — cleaned, joined, enriched, and shaped for business questions. When someone asks "what was revenue last month?" there should be exactly one place to look, and the answer should be the same no matter who looks.

I structure the warehouse in three zones: staging (cleaned copies of source tables), intermediate (business logic joins and calculations), and marts (final tables optimized for specific business domains). This staging-intermediate-mart pattern isn't original, but it works because it makes the transformation logic traceable. When a number looks wrong, you can walk backwards from the mart to the staging table and find exactly where the logic diverged from reality.

Orchestration: the nervous system

Orchestration handles scheduling, dependencies, monitoring, and alerting. It's the layer that ensures data flows through the platform reliably and that someone knows when it doesn't. At the logistics company, we started with a simple workflow orchestrator — nothing exotic. The key was getting dependency management right: a mart table shouldn't refresh until every staging table it depends on has successfully loaded.

Analytics: where value is delivered

The analytics layer is where business users interact with data. Dashboards, reports, ad-hoc queries, self-service exploration. This is the only layer the business actually sees, which makes it the only layer they care about. Everything underneath exists to make this layer trustworthy and fast.

Why does the order matter? Because each layer depends on the one below it. You can't model warehouse tables without raw data in the lake. You can't build reliable dashboards without modeled tables in the warehouse. And you can't run any of it without orchestration keeping the pipeline alive. The layers aren't just organizational — they're a dependency chain.

The build sequence that actually works

Here's the biggest mistake I see in greenfield data platforms: building bottom-up. The team spends three months building a comprehensive lake layer. Then two months modeling the warehouse. Then they finally connect a BI tool and show the business a dashboard. Five months in, the business sees value for the first time, and half the models are wrong because nobody validated them against real questions.

The right approach is a vertical slice. Start with one business question — the one that causes the most pain — and build the thinnest possible path from source to dashboard. One source system. One lake table. One warehouse model. One dashboard. Prove value in week two, not month three. Then widen.

At the logistics company, the first slice was delivery cost per route. The COO was spending two hours every Monday morning manually calculating this from spreadsheet exports. We connected the relational database (one source), loaded the orders and routes tables into the lake, built a warehouse model that joined them with cost data, and put a dashboard in front of the COO. Two weeks. That Monday, she had the number in 30 seconds instead of two hours.

That first win bought us everything: executive buy-in, team patience, budget for the next slice. By week four we had three slices running. By week eight, the platform covered all four source systems and the finance team was using it for monthly close.

The horizontal approach would have delivered the same platform — eventually. But the vertical approach delivered value continuously, validated our models against real questions, and built the organizational trust that let us make bolder decisions later.

Migration: the hard part nobody warns you about

Architecture diagrams are clean. Migration is messy. Moving data from four heterogeneous source systems into a unified platform — without disrupting operations — is where most of the actual work happens. And it's where the most painful lessons are.

Here's what I learned migrating the logistics company's data:

Incremental extraction is non-negotiable

Never full-scan a production database on every pipeline run. Extract only what changed since the last run using timestamps, change data capture, or log-based replication. Full scans work on day one when you're loading history. After that, they'll crush your source system's performance and your pipeline's runtime. At the logistics company, the orders table had 40 million rows. A full scan took 90 minutes and caused visible latency in the product. Incremental extraction took 30 seconds.

Idempotent pipelines save your sanity

Every pipeline must produce the same result whether you run it once or ten times. This means delete-and-replace by partition, not blind appends. When a pipeline fails halfway through (and it will), you need to re-run it without creating duplicates. The logistics company's legacy system had no reliable timestamps, which meant we couldn't do true incremental extraction. Instead, we did daily full snapshots with partition-level idempotency: each run replaced the entire day's partition, so re-runs were safe.

Schema mapping is the intellectual work

Connecting to a database and copying tables is the easy part. The hard part is understanding what the data means. The logistics company's relational database had a column called status on the orders table with values 1 through 7. No documentation. The original developer had left two years ago. It took three days of reading application code and interviewing operations staff to build a complete mapping. Multiply this by every table across four source systems, and you understand why migration timelines always blow past estimates.

Data validation catches what tests can't

Automated tests verify that transformations work correctly. Data validation verifies that the data itself makes sense. Row count checks (did we load roughly the same number of rows as yesterday?), null rate monitoring (did a column that's usually 99% populated suddenly drop to 80%?), range checks (are there negative values in a field that should always be positive?). We caught a schema change in the document store three hours after it happened because our null rate monitor fired. Without it, we'd have discovered the problem when the COO's dashboard showed wrong numbers — probably on Monday morning.

The team question: who maintains this after you leave?

A data platform without a team to maintain it is a time bomb. It works great for three months, then the first source system changes its schema, a pipeline breaks, nobody knows how to fix it, and the business goes back to spreadsheets.

At the logistics company, the team question was the hardest part of the engagement — harder than the architecture, harder than the migration. Here's what I learned:

You need at least one person from day one

Even if the company isn't ready to hire a full data team, one engineer needs to be embedded in the platform build from the start. Not reviewing PRs after the fact. Pair programming, making decisions together, understanding why things are designed the way they are. At the logistics company, we had a backend engineer who spent 50% of their time on the platform build. By week six, they could debug pipeline failures independently. By week twelve, they could build new pipelines from scratch.

Documentation is the handoff

Architecture decision records for every significant choice. Runbooks for common failure scenarios. A data dictionary for every table in the warehouse. A monitoring guide explaining what each alert means and how to respond. I wrote all of this as we built, not after. The documentation was the platform's immune system — the thing that kept it alive after I left.

The right first hire is an analytics engineer, not a data engineer

This is counterintuitive. The platform needs data engineering to run, but what the business needs is someone who can build new warehouse models and dashboards. An analytics engineer who can write SQL, build models, and talk to business stakeholders delivers more ongoing value than a data engineer who optimizes pipeline performance. Hire the analytics engineer first. The infrastructure should be stable enough to not need daily engineering attention — if it does, you built it wrong.

What I'd add today: a knowledge base

If I were building that logistics company's platform today, I'd add a fifth component that didn't exist in my original design: a structured knowledge base.

Every greenfield platform generates an enormous amount of institutional knowledge: why we chose this warehouse over that one, what the status column values mean, how the legacy system's export format changes during month-end processing, what happens when the document store's replication lag exceeds 30 seconds. This knowledge lives in people's heads, in Slack threads, in scattered documentation that nobody maintains.

A knowledge base captures all of it in a structured, searchable, cross-referenced format. Not a wiki that humans have to maintain (because they won't). An LLM-maintained knowledge base where you drop in source documents — architecture decision records, postmortems, meeting notes, schema documentation — and the LLM builds and maintains a structured wiki with confidence scoring, contradiction detection, and cross-references.

Here's why this matters for greenfield platforms specifically:

Onboarding — when the first data hire joins and I leave, they don't have to reverse-engineer every decision. The knowledge base has architecture decision records explaining why the warehouse is structured this way, why we use partition-level idempotency instead of merge, why the legacy system gets a daily full snapshot instead of incremental extraction.
Debugging — when a pipeline breaks at 3 AM, the knowledge base has failure patterns: "When the document store's null rate spikes above 5% on the driver table, check for upstream replication lag before investigating the pipeline." These patterns are invisible in code but critical for operations.
Evolution — when the business asks for a new data source six months after I leave, the knowledge base has the integration playbook: how we approached each previous source, what the common pitfalls were, what the extraction patterns looked like. The next engineer doesn't start from zero — they start from the accumulated experience of the build.

I've written about this in detail in The Knowledge Base Strategy, and I've open-sourced a template you can use: knowledge-base-template. If you're building a data platform, build its memory alongside it. Both compound.

The bottom line

Building a data platform from zero is one of the most impactful things you can do for a growing company. It transforms decision-making from intuition to evidence, from monthly spreadsheets to real-time dashboards, from conflicting numbers to a single source of truth.

But the specifics matter more than the ambition. Here's what I'd tell anyone starting this journey:

Don't build until you're sure you need it. A BI tool with direct connections is fine for many companies. A platform introduces complexity you have to maintain forever.
Start with a vertical slice, not a horizontal foundation. One source, one question, one dashboard. Prove value in week two. Widen from there.
The four-layer architecture works. Sources, lake, warehouse, analytics. It's not the only way, but it's the simplest architecture that handles real-world complexity without overengineering.
Migration is the hard part. Budget twice the time you think. Schema mapping, data validation, and handling source system quirks is where the real engineering happens.
Plan the team from day one. The platform must survive your departure. Embed someone early, document relentlessly, hire an analytics engineer before a data engineer.
Build the knowledge base alongside the platform. Every decision, every failure, every schema quirk — capture it. The platform and its institutional memory compound together.

The simplest architecture that meets current needs with clear extension points. Don't design for Google scale on day one. Design for the next twelve months, with clean seams where future you can extend. The companies that get data right don't build the biggest platforms — they build the right-sized ones, and they build them with the wisdom to evolve.

Building a data platform from scratch?

I've done it multiple times — from architecture to implementation to team handoff. Let's talk about your situation.

Book a Discovery Call Read the Case Study