FailHub – Issue #4
Escape from the Monolith, Green CI and Red Production, The Zombie Project
FailHub helps people avoid mistakes by learning from others who already made them.
If your experience can help someone else, share it here.
Every week, FailHub covers three real failures from three different people working in tech, each with their own experience and perspective.
Let’s start with the first one. Enjoy ☺️
Failure #1: The Playground That Quietly Replaced the Main Project
What happened
Our Android app had grown into a monster. Three years of active development, fifteen different engineers rotating through the team, a multi-layer architecture that made sense when we had five features but now felt like overkill for everything. Feature flags everywhere—some dating back eighteen months with nobody remembering what they controlled. Analytics calls on every screen transition. A/B experiments running on top of other experiments. And buried deep in the codebase, chunks of legacy code with comments like “TODO: refactor this” from 2019.
Just launching the app in debug mode took four minutes on my MacBook Pro. Not an exaggeration—I timed it multiple times. Getting to a specific screen deep in the navigation stack? Add another three minutes of clicking through. Want to test a change to the checkout flow? Better hope you remember the exact combination of feature flags and user states needed to even see that screen.
In the beginning, I did what everyone does—I experimented directly in the main codebase. Need to try out Jetpack Compose? I’d add a new composable behind a temporary feature flag, wire it into an existing screen, and see what happens. Evaluating a new networking library? Bolt it onto a real API call and hope it doesn’t break production monitoring. Testing a different navigation pattern? Modify three different navigation graphs, update two view models, and pray you don’t accidentally ship it.
Every single experiment felt like moving through concrete. Build times averaged twelve minutes for a clean build, seven for incremental. The feedback loop was agonizing. Make a change, wait for the build, launch the app, wait for it to start, navigate to your screen, test your change, find a bug, repeat. One iteration could easily take twenty minutes. And God forbid you made a mistake that crashed the app—now you’re debugging through layers of infrastructure code that has nothing to do with what you’re actually trying to learn.
The worst part? The constant fear. Every time I added experimental code, I worried about breaking something unrelated. Did I accidentally leave a debug flag on? Will this crash for users on previous Android versions? Is this going to conflict with Sarah’s feature that’s supposed to ship next week? The mental overhead killed any joy in exploration.
Over six months, I noticed I was learning slower and slower. Not because I was getting dumber or because the Android APIs were harder—but because the environment had become hostile to curiosity. Testing a simple idea required deep knowledge of the entire app’s internals, constant context switching between different systems, and this perpetual anxiety about breaking things.
One Friday afternoon, after spending three hours just trying to get a sample project for a new animation library working inside our app, I snapped. I thought, “Why am I doing this?” I cloned the library’s GitHub repo, opened their sample app, and had it running in ninety seconds. No feature flags. No navigation setup. No build configuration nightmares. Just the code I cared about.
That moment changed everything.
I started creating tiny, disposable projects for every experiment. One project per idea. Want to understand Kotlin Coroutines better? New project. Evaluating a PDF rendering library? New project. Testing if ExoPlayer can handle our video requirements? New project. No architecture debates with the team. No backward compatibility concerns. No “but what about users on Android 8?” Just enough code to answer one specific question: does this work the way I think it does?
Some of these projects lived for two hours. Others survived a weekend. When evaluating a new library, I’d clone the official sample, break it, modify it, abuse it until I understood its limits. I’d test APIs in complete isolation, without production constraints, without analytics, without any of the cruft that muddies the results. I could iterate in minutes instead of hours.
The surprising part? This became a permanent habit. These playground projects turned into my personal laboratory. I accumulated maybe forty of them over a year—small, focused repositories where I could safely explore. When something proved genuinely valuable and stable, only then would I carefully port it into the production app, armed with a real understanding of the trade-offs.
Even better—a few of these experiments evolved into actual side projects. One became a personal finance tracker that I use daily. Another turned into a small app for photographers that now has 5,000 downloads and makes $200/month in passive revenue. They started as throwaway code for learning and became real products.
My teammates noticed the change. “How do you know so much about this library already?” they’d ask when I’d propose something new. The answer was always the same: I’d already spent a weekend breaking it in a safe environment.
The lesson
Large, mature codebases are terrible environments for learning new things. The overhead is too high, the feedback loops too slow, the fear of breaking things too paralyzing.
When you’re exploring new ideas, fast feedback and isolation beat realism every single time. You learn faster in a simplified environment because you can iterate faster. You retain more because you’re not drowning in unrelated complexity.
Don’t fight your production codebase when you’re trying to learn. Create playgrounds. Make them disposable. Embrace their simplicity. Learn quickly, then apply carefully.
The production app will still be there when you’re ready. But you’ll approach it with actual knowledge instead of hopeful guessing.
Failure #2: The Release That Passed Every Check Except Reality
What happened
From the team’s perspective, the 4.2 release was picture-perfect. We’d spent three weeks on it. The code compiled without warnings. Android Studio’s lint checker showed zero issues—I’d personally gone through and fixed the last three warnings the day before. Our unit test suite hit 78% coverage and every test passed. The CI pipeline on GitHub Actions? All green checkmarks. Six different automated checks, all successful.
I’d tested the app thoroughly on my Pixel in debug mode. Scrolled through every screen. Tried the new share functionality we’d added. Verified the crash fixes from user reports. Everything worked beautifully. The pull request had approvals from two senior engineers. QA had signed off after their pass through the staging build.
On paper, we were absolutely ready to ship.
But I’d been burned before. Three years of releasing Android apps had taught me a painful truth: the worst production bugs don’t come from obvious mistakes that lint tools catch. They come from the weird edge cases. The differences between debug and release configurations. OS quirks that only manifest on certain devices. Assumptions that seem bulletproof until they meet real-world conditions.
Let me tell you about release 3.8, eight months prior. That release had also looked perfect internally. We shipped it to 50,000 users on a Friday afternoon (yeah, Friday afternoon). By Saturday morning, the crash reports started flooding in. Turns out ProGuard—the tool that shrinks and obfuscates release builds—had aggressively stripped out a reflection-based serialization library we were using. The debug build worked fine because ProGuard doesn’t run on debug builds. We’d tested the release build in the emulator, but we’d forgotten one critical step: testing on actual devices with the actual ProGuard configuration we’d ship.
We spent the entire weekend rolling back, fixing the ProGuard rules, and pushing a hotfix. I got maybe six hours of sleep across two days. Users were pissed.
Or the background service issue in release 3.5. On our devices, our background sync worked great. On real user devices, especially Samsung phones with aggressive battery optimization, our service kept getting killed. We’d tested on Pixel phones. Users had Samsungs, Xiaomis, Huaweis—each with their own battery management quirks.
After enough of these disasters, I implemented a non-negotiable rule for myself and strongly advocated it for the team.
Before any release goes out, the app must be tested manually in its exact release configuration. Not debug. Not a “release-like” build. The actual, signed APK that will be uploaded to the Play Store, with the same ProGuard rules, the same signing configuration, the same build optimizations.
It must be tested on at least three different scenarios:
Minimum supported OS version on a physical device
Maximum/current OS version on a physical device
At least one mid-range device that isn’t a Pixel (usually a Samsung)
And here’s the critical part: I never tested these on my own machine. I ran a separate CI job on a clean machine that:
Built the release APK from scratch
Ran lint and all automated tests
Generated a test report
This removed the “works on my machine” bias completely. I’ve seen too many cases where some local environment quirk made things pass that shouldn’t have.
For release 4.2, I followed this process religiously. Built the release APK, installed it on my old Samsung Galaxy. Launched it. Immediately noticed that our new splash screen animation stuttered badly—something about hardware acceleration not being enabled properly in the release configuration. This would have been our users’ first impression. The debug build? Silky smooth.
Dug deeper. Found that one of our dependencies had a release-specific initialization that was blocking the main thread for ~800ms. Only visible in release builds because of how the code was optimized.
Fixed it. Built again. Tested again. This time found that on another Android version, a specific share intent we’d added was crashing because we were using an API that needed a higher minimum SDK. The lint tool missed it because of how we’d wrapped the call—technically safe, but practically broken.
Fixed that too. Third build. Third test pass. Finally clean.
This process probably added six hours to our release timeline. The team sometimes got impatient. “We’ve already tested this,” they’d say. “CI is green. Let’s just ship it.”
But the cost of skipping this process was always higher. Way higher. Emergency weekend hotfixes. User complaints. Reputation damage. App store rating drops. The six hours of manual testing was insurance.
The lesson
Automated tooling increases confidence, and you should absolutely use it. Lint checkers, static analyzers, unit tests, CI pipelines—they’re all valuable. They catch many bugs.
But they don’t catch everything. They can’t simulate the chaos of real devices, real OS versions, real user behaviors, real network conditions.
Tooling tells you what’s probably fine. Reality decides what actually works.
If you don’t test the exact artifact your users will install, in the exact conditions they’ll experience, you’re not testing—you’re guessing. And guessing is how you spend weekends firefighting production issues instead of enjoying your life.
Test the real thing, on real devices, in real conditions. Every single time. Make it non-negotiable.
Failure #3: The Project That Survived Because No One Killed It
What happened
The project kicked off with genuine energy. It was February 2023, and we’d identified a real problem: our customer onboarding flow had a 43% drop-off rate. Users would start creating an account, hit our identity verification step, and just... vanish. We were losing nearly half of our potential customers at the gate.
The initial pitch made sense. We’d redesign the verification flow, maybe integrate a third-party KYC service, A/B test different approaches, and improve that metric. The exec team greenlit an exploration phase: “Spend six weeks understanding the problem deeply, then come back with a concrete proposal.”
Perfect. Reasonable. Smart, even.
The team was solid—me as tech lead, two senior engineers, a designer, and a product manager who’d successfully shipped three major features in the past year. We had momentum. We had data. We had a clear problem to solve.
Week one, we dug into analytics. Interviewed customers who’d abandoned the flow. Researched competitor approaches. The findings were interesting but not conclusive. Some users complained about the number of steps. Others were confused about why we needed certain documents. Some just didn’t trust uploading their ID.
Week four, PO suggested we needed more data before committing to a direction. “Let’s talk to ten more users. Let’s see what our enterprise customers think—they might have different needs.” Made sense. We extended the exploration to eight weeks.
Week eight rolled around. We presented three possible approaches to the stakeholders: streamline the current flow, integrate a third-party service, or build a progressive verification system where we verify in stages. The room nodded along. Lots of good questions. But then: “This is great research, but we need to validate the cost-benefit of each approach. Can you put together detailed estimates and project the impact on that 43% drop-off rate?”
Another month of work. We built financial models. We estimated engineering effort. We projected conversion improvements—which was honestly guesswork dressed up in spreadsheets, but everyone wanted numbers.
By week sixteen, we’d pivoted three times. First it was about the verification flow. Then it became about the entire onboarding experience. Then someone from marketing joined our meetings and suddenly we were also talking about brand trust and messaging strategy. The scope had quietly tripled.
Here’s the insidious part: at no point did anyone say, “Stop. This is taking too long.” Everyone had good intentions. Michael was being thorough. The stakeholders were being prudent. Leadership was being strategic. But nobody owned the final decision. Not really.
We had a steering committee with seven people. Every meeting, someone would raise a new concern: “But what about mobile users?” “Have we considered accessibility?” “What’s the plan for international users?” All valid questions. All requiring more research.
Month five. We were still in “exploration.” The team’s energy had shifted from excited to exhausted. We’d hold meetings where we’d discuss the same trade-offs we’d discussed four weeks earlier, just with slightly different data. Our Slack channel went from active brainstorming to passive updates.
The CEO asked about progress in an all-hands. Michael gave a polished update: “We’re being rigorous in our approach. We want to make sure we get this right.” More nods. More time.
Month seven. I started noticing something. The project had become unkillable. Not because it was succeeding, but because stopping it would require someone to stand up and say, “We’re pulling the plug.” That’s an uncomfortable conversation. It means admitting the time was wasted. It means hard questions about why we spent seven months without a decision.
So the project just... continued. It survived by inertia.
Month eight, leadership finally pushed for a decision. We had to ship something. We’d burned too much time to walk away empty-handed. We picked the “safest” option: a moderate redesign of the existing flow with some UX improvements. Not revolutionary. Not the third-party integration we’d explored. Not the progressive verification system that might have been genuinely innovative.
We shipped it in month ten. It worked. It was fine. The drop-off rate improved from 43% to 39%—a measurable improvement, but nothing like what we’d hoped for originally. More importantly, it wasn’t what the project had initially promised. We’d started with ambitious goals and shipped a compromise.
Here’s what haunts me: the failure wasn’t technical. The code was good. The team was capable. The execution was solid. The design was thoughtful.
The real failure happened in month two, when we didn’t lock down scope. And in month four, when we didn’t assign a single decision-maker with real authority. And in month six, when we didn’t force a go/no-go decision.
By the time we realized we’d been drifting, we’d invested too much to justify a reset. Sunk cost fallacy in action.
Looking back, if we’d been forced to decide in week eight—really decide, with clear ownership and commitment—we’d either have shipped something great by month four, or we’d have killed the project and reallocated the team to something more valuable. Instead, we spent ten months delivering mediocrity because no one had the authority or courage to force clarity.
The lesson
Ambiguity is a slow poison for projects. It doesn’t kill them quickly—it keeps them alive in this zombie state where they’re not dead but they’re not really alive either.
Every project needs three things locked down early, and I mean week two or three early:
Concrete scope: Not “improve onboarding” but “reduce drop-off at the verification step by 15% by implementing X, Y, and Z.” Specific enough that you can tell if you’re on track or off track.
Single decision-maker: Not a committee. One person who can say “yes, we’re doing this” or “no, we’re stopping.” Committees can advise. One person decides.
Forced decision point: A date on the calendar when the decision must be made. Not “when we have enough data” because you’ll never feel like you have enough data. A specific date. And when that date comes, you decide—even if it feels premature.
If you don’t force these decisions early, the project will drift. It’ll survive through ambiguity and momentum. Activity will substitute for progress. Meetings will multiply to fill the vacuum left by missing clarity.
And you’ll burn months or years on something that either should have been killed early or should have shipped fast.
The project doesn’t decide for you. But if you don’t decide, the project will certainly decide what happens—and it won’t be what you hoped for.
If this issue helped you think differently, I appreciate a like ❤️
FailHub grows through shared experience, so if you have a story to contribute, you’re always welcome.
Please feel free to use the comments to discuss this issue.
What would you have done differently?


