I was brought in as a technical consultant to a major Brazilian e-commerce marketplace. The kind of company where millions of transactions flow through the platform every month, where a slow checkout page isn't an inconvenience but a revenue event, and where the engineering organization had grown fast enough that nobody was quite sure how many microservices they were running anymore.

The numbers told the story before anyone had to say it out loud. Four development teams, one infrastructure team, and a site uptime hovering at 92%. That sounds almost acceptable until you do the math: 8% downtime on an e-commerce platform means roughly 58 hours a month where customers can't buy things. In Brazilian e-commerce, where competition is fierce and customer loyalty is thin, those hours translate directly to lost revenue and eroded trust.

But the uptime number was a symptom, not the disease. The real problem lived in the bug tracker: over 200 open tickets, some dating back more than a year. Broken search filters that returned wrong results. Payment callbacks that silently failed under specific conditions. Admin panel features that had never worked correctly since launch. Edge cases in the inventory system that occasionally oversold products, creating fulfillment nightmares.

The engineers knew about all of it. They'd filed the bugs themselves. But ticket after ticket sat in the backlog, aging quietly, because there was always a feature to ship, always a deadline to hit, always a product manager with a roadmap that didn't have room for "fix the thing that's been broken for six months."

Morale was predictably low. Not the dramatic kind of low where people quit loudly. The quiet kind, where engineers stop caring about quality because they've learned that quality isn't what gets rewarded. Ship the feature. Hit the sprint goal. Move on. The bugs will still be there next quarter.

The diagnosis: small bugs, big backlog, zero permission

I spent my first week doing what I always do when I arrive somewhere new: I read. I read the bug tracker. I read the incident reports. I read the deployment logs. I talked to engineers on every team, not about what they were building, but about what was broken and why it hadn't been fixed.

The pattern was remarkably consistent. I categorized every open bug by estimated fix time, and the distribution was almost comically skewed:

The Bug Backlog: 200+ Tickets by Fix Time Small (<2h) Medium (2-8h) Large (>1 day) 70% ~140 bugs 20% ~40 bugs 10% ~20 Most debt is small and fixable. It just never gets permission to be fixed.

Seventy percent of the bugs were small. Trivially small. A missing null check. A wrong sort order. A CSS overflow on mobile. A misconfigured timeout. A hardcoded string that should have been a config value. Each one would take an experienced engineer less than two hours to fix, including testing.

But in a feature-driven sprint cadence, two hours is a luxury. When the product team has committed to delivering three features this sprint, nobody is going to pull an engineer off feature work to fix a search filter that "mostly works." The cost of each individual bug is small. The accumulated cost of 140 small bugs is a platform that feels broken, an uptime number that embarrasses everyone, and an engineering team that has internalized the message that quality doesn't matter.

The bugs weren't hard. They were just never given permission to be fixed.

The hackathon: two days, one rule, no code review

I proposed something simple to the engineering leadership: give me the teams for two days. Two days where the only work allowed is bug fixes. No features. No roadmap items. No planning meetings. Just engineers, a bug backlog, and permission to fix things.

There was resistance, naturally. Two days of no feature work is an eternity in a startup-speed e-commerce company. Product managers calculated the feature velocity they'd lose. Engineering managers worried about sprint commitments. I made the case with the numbers: the bugs were costing more in aggregate — through downtime, customer complaints, workarounds, and engineer frustration — than two days of paused feature work. And the uptime number was becoming a board-level conversation.

They agreed. We called it a hackathon, because "bug fix sprint" doesn't inspire anyone, and framing matters.

The rules were deliberately minimal:

  • Only bugs. No features, no refactoring projects, no "while I'm in here I might as well." Fix the bug, close the ticket, move on.
  • Pick from the backlog. Every team works from the shared bug backlog. Pick what you know, pick what annoys you, pick what your users complain about. First come, first served.
  • No code review required. This was controversial. But the bugs were small, the engineers were experienced, and the bottleneck we needed to remove was friction. Trust the team. Ship directly.
  • Track everything on a board. A physical board (and a digital mirror) with three columns: To Do, In Progress, Done. Every squashed bug gets moved with a satisfying physical gesture. Visibility matters for momentum.

That last point — the board — turned out to be more important than I expected. There's something deeply motivating about watching a wall fill up with completed tickets over the course of a day. By lunch on day one, the Done column had overflowed its allotted space. Engineers were taking photos and posting them in the team chat. The energy in the office was different — people were laughing, high-fiving, competing informally to see which team could close the most tickets.

It felt like the early days of a startup, when shipping was fast and fixing things was the whole job. Because for two days, it was.

The results: 70% gone in 48 hours

By the end of day two, the numbers were hard to argue with:

2-Day Hackathon Results BEFORE AFTER Bug Backlog 200+ ~60 Uptime 92% 99% Team Morale Low High -70% +7pp Uptime improvement measured over the following month

The bug backlog dropped from over 200 tickets to roughly 60. The remaining 60 were the genuinely hard bugs — the ones that required architectural changes, cross-team coordination, or deep investigation. The 140+ that were squashed were exactly what the analysis predicted: small, fixable issues that just needed someone to sit down and do the work.

Within a month, uptime climbed from 92% to 99%. Not because we'd rebuilt the infrastructure or deployed some new monitoring system. Because we'd fixed the hundred small things that were collectively dragging the platform down. A payment callback that silently failed once every thousand requests. A cache invalidation bug that served stale product data. A connection pool that wasn't sized correctly for peak traffic. Each one contributed a fraction of a percent of downtime. Together, they accounted for seven percentage points.

But the most important result wasn't in the metrics. It was in the conversations I overheard after the hackathon. Engineers saying things like: "I can't believe that bug was there for eight months — it took me forty minutes to fix." And: "Why don't we do this every month?" And, most tellingly: "I actually feel good about our codebase for the first time in a year."

The hackathon didn't just fix bugs. It fixed the team's relationship with their own code.

The deeper lesson: permission is the bottleneck

Here's what most people get wrong about tech debt. They think the problem is technical. That the codebase is too complex, the architecture is too tangled, the debt is too deeply embedded to address. And sometimes that's true — some tech debt really is structural and requires significant investment to resolve.

But most of it isn't. Most tech debt — the kind that drags down uptime, frustrates users, and demoralizes engineers — is a pile of small, individually fixable issues that accumulate because the organization's incentive structure doesn't reward fixing them.

Why Tech Debt Persists Skills Engineers know how to fix the bugs OK Time Needs explicit allocation ? Permission The real bottleneck Stop shipping features. Fix things. UNLOCK THIS The bottleneck isn't technical. It's organizational.

Engineers know what's broken. They filed the bugs. They've been working around them for months. They've mentally sketched the fixes during standup while listening to yet another feature spec. The skill is there. The knowledge is there. The motivation is there — engineers want to work on a codebase they're proud of.

What's missing is permission. Explicit, unambiguous, top-down permission to stop shipping features and fix things. Not "you can work on bugs if you have spare time at the end of the sprint" — that time never materializes. Not "we should really address tech debt soon" — "soon" is a word that means "never" in product roadmaps. Real permission. Two days. Everyone. Only bugs. No exceptions.

The hackathon worked not because it was a clever format. It worked because it was a permission structure. It gave engineers something they couldn't give themselves: legitimate, protected time to fix the things they already knew how to fix. The hackathon wasn't a technical intervention. It was a cultural one.

This is why I tell engineering leaders: if your team has a large bug backlog and mediocre uptime, the first question isn't "what's wrong with our architecture?" The first question is "when was the last time we gave our engineers permission to fix things?"

What I changed after the hackathon

A hackathon is a burst. It produces dramatic results, but bursts don't sustain. The 70% bug squash would have been a feel-good memory that faded within months if we hadn't built sustaining practices around it. Here's what we put in place:

Sustaining the Gains: From Burst to Practice 1 Hackathon 2-day burst -70% bugs 2 Code Review Policy Tests on all PRs 3 Intern Training Mentored dev practices 4 Fix-it Fridays Recurring permission 5 CD System Kubernetes Fast, safe deploys Burst Sustained Practice

Code review policy with mandatory tests

Before the hackathon, code review was optional and tests were aspirational. We instituted a policy: every pull request requires at least one reviewer and must include tests for the changed behavior. This sounds basic — because it is. But "basic" and "implemented" are different things. The policy prevented the backlog from refilling at the same rate. New code was held to a higher bar. Bugs still happened, but fewer of them were the trivial kind that dominated the old backlog.

Intern developer training

The company had several intern developers who were eager but hadn't been given structured mentorship. I set up a training program focused on testing practices, debugging methodology, and code review skills. This served two purposes: it developed junior talent (long-term investment) and it created additional capacity for maintaining code quality (short-term benefit). Within a few months, the interns were catching bugs in code review that would have previously made it to production.

Fix-it Fridays

Every other Friday afternoon was reserved for bug fixes and small improvements. No approval needed, no justification required. This was the hackathon's lasting legacy: recurring permission. It prevented the slow re-accumulation of small bugs that had created the original problem. It also gave engineers a predictable outlet for the quality improvements they wanted to make, which measurably improved morale.

Continuous deployment on Kubernetes

We built a proper CD pipeline on Kubernetes, replacing a manual deployment process that involved SSH-ing into servers and running scripts. This had two effects: it made deploying fixes fast and safe (minutes instead of hours, with automatic rollback), and it removed a hidden friction that had discouraged small fixes. When deploying is painful, engineers batch changes into large releases. When deploying is painless, they ship fixes as soon as they're ready.

The pattern for other teams

I've since used variations of this approach at other companies. The specifics change — the backlog size, the team structure, the particular flavor of tech debt — but the pattern is consistent. Here's the playbook:

Step 1: Assess the backlog honestly

Categorize every open bug by estimated fix time. Be honest — don't inflate estimates to make the problem seem harder than it is. If 60-70% of your bugs are small, you have a permission problem, not a technical one. If most bugs are genuinely complex, you have a different problem (likely architectural) that a hackathon won't solve.

Step 2: Get leadership buy-in for two days

This is the hardest step. Present the data: the backlog size, the uptime cost, the morale impact. Calculate what the bugs are costing in concrete terms — downtime minutes, customer complaints, engineer time spent on workarounds. Two days of paused feature work is almost always cheaper than the ongoing cost of the debt. Make that case with numbers, not feelings.

Step 3: Run the hackathon with minimal rules

Keep it simple. Only bugs. Pick from the backlog. Track on a board. Don't over-organize it — the whole point is removing friction. The energy and momentum will emerge naturally once engineers realize they actually have permission to fix things.

Step 4: Celebrate and measure

Make the results visible. Share the before/after numbers with the whole company. Let engineers talk about what they fixed and how long it took. The "I can't believe that was only a 20-minute fix" stories are the most powerful advocacy for making this a regular practice.

Step 5: Build sustaining practices

This is where most teams fail. The hackathon produces a burst of progress, everyone feels great, and then three months later the backlog is right back where it started. The sustaining practices — code review requirements, test mandates, recurring fix-it time, fast deployment pipelines — are what prevent regression. Without them, you're just doing a feel-good exercise that you'll need to repeat every quarter.

What goes wrong when you skip steps

Skip the assessment, and you'll waste the hackathon on complex bugs that can't be fixed in two hours — demoralizing instead of energizing. Skip the leadership buy-in, and the hackathon gets interrupted on day one by "urgent" feature requests. Skip the sustaining practices, and you'll be back to 200 bugs within six months. Every step matters. The hackathon itself is just the catalyst.

Capture what you learn, or lose it

A hackathon produces a burst of institutional knowledge. Over two days, engineers touch parts of the codebase they haven't looked at in months. They discover why things are broken, how the breakage manifests, and what the fix looks like. They find hidden dependencies, undocumented behaviors, and code that nobody remembers writing.

If you don't capture that knowledge, it evaporates. The bugs are fixed, but the understanding of why they existed and what they revealed about the system disappears back into individual engineers' heads, where it will be forgotten within weeks.

At the company, we required every engineer to write a brief note for each bug they fixed: what was the root cause, what was the fix, and what did it reveal about the system. These notes went into a shared knowledge base. Some of them turned out to be more valuable than the bug fix itself — "this module has no test coverage and a hardcoded dependency on the payment gateway" is a note that prevents future bugs, not just fixes past ones.

This is a pattern I've written about before: building a knowledge base that compounds your team's understanding of the system over time. A hackathon is a particularly concentrated source of that knowledge. Don't let it go to waste.

The real story

People remember the numbers: 70% of bugs squashed, uptime from 92% to 99%. Those numbers are real and they mattered to the business.

But the real story of that hackathon is simpler. A room full of engineers already knew what was broken. They already knew how to fix it. They'd been carrying that knowledge around for months, watching the bug count climb, feeling the slow erosion of pride in their work.

All they needed was someone to say: "Stop. Fix it. You have permission."

If your team has a growing bug backlog and declining quality metrics, ask yourself honestly: is the problem that the bugs are too hard to fix? Or is the problem that your organization has never given anyone explicit permission to fix them?

The answer is almost always the latter. And the fix is almost always simpler than you think.

Tech debt slowing your team down?

Sometimes the fix isn't technical — it's structural. I help teams diagnose and address the real bottlenecks.

Book a Discovery Call