The Clean Data Trap: Why Your AI Strategy is Hiding Your Business Failures
This article was originally published on LinkedIn on January 24, 2026. Read it there.
When we say "data quality," what problem are we actually trying to solve?
A January 21, 2026 Fortune article quoted a Google Cloud exec calling this moment a "Great Reset"—deterministic software giving way to probabilistic systems. My takeaway is simple: as AI gets less deterministic, teams are trying to compensate by making the data feel more deterministic. That's where the clean-data obsession starts.
Data Cleaning can feel like Inventory Fraud
Companies are trying to move fast on AI and discovering the data they have doesn't match what their roadmaps assume. Gartner found this: projects get abandoned when "AI-ready" data never shows up.
The typical response is a "cleanup program": scrub outliers, impute missing values, standardize everything into a single "trusted" table. This just creates a failure mode where the dashboard looks clean while the business stays broken. One place I've seen this most is when late-arriving price / cost updates get 'fixed' downstream instead of tracing the issues to the source job.
I think about it like a warehouse stockout. If a shelf is empty, a good operation doesn't pretend it isn't. A stockout is a signal: a supplier didn't ship, receiving didn't post, replenishment logic is wrong, or a system stopped emitting events. You don't "fix" a stockout by placing an empty box on the shelf so the report looks complete.
In data, we do the equivalent all the time.
A missing price, a null cost, or a bizarre margin outlier isn't just "dirty data." It's often a symptom of an operational defect. When you impute the missing price so the model trains, or winsorize the outlier so the curve looks well-behaved, you aren't improving reality. You're deleting the most diagnostic evidence you have.
If the null disappears and nothing gets paged, triaged, or fixed upstream, you didn't improve governance. You suppressed it.
Inventory fraud (definition): any transformation that turns a missing business signal into a plausible value without preserving a traceable defect record (and an owner upstream).
Does Masking Work?
The problem isn't just operations, it's technical too. Niloofar Mireshghallah has shown that surface-level sanitization (masking names or redacting identifiers) is often an illusion. AI models are inference engines; they can see through the "mask" to the underlying signal.
The same applies to business data. "Cleaning" the mess doesn't remove the risk; it just masks the signal. As Mireshghallah noted on The Information Bottleneck podcast, models "roll with" the information they are given. If we give them sanitized, "perfect" data that doesn't reflect our messy reality, we are training our business to lie to itself.
Here's the shift I'm arguing for:
Refinery mindset: "If it's messy, delete it." → clean dashboards, hidden defects, brittle models.
Diagnostic lab mindset: "If it's messy, trace it." → visible failure modes, accountable owners, better tradeoffs.
AI as a diagnostic tool
Stop treating AI as a sanitizer that turns bad inputs into good outputs. Use it like a Diagnostic Lab that tells you where the process is failing.
Pick your "Tier-1" Fields: Identify decision drivers (price, cost, identity keys). Define a Defect Budget. If Tier-1 missing data crosses 0.5%, it's not a modeling issue, but an operational incident with a named owner.
Stop the Rework: If downstream cleaning consumes >40% of engineering hours, freeze the AI roadmap and force upstream fixes. Cleaning should be a temporary containment step, not a default operating model.
Value the Stockout: Don't delete the "mess." Keep the imputed value if you must, but store the original null and an
imputed=trueflag.Trace Provenance: If you can't say where a record came from and what touched it, quarantine it from automation. "AI-ready" must mean "traceable," not "sterile."
If "AI-ready" means "no nulls," you'll ship models trained on fiction. If it means "traceable defects with accountable owners," you'll ship systems that improve reality.
To be continued...