Most teams do not need a “dark factory.” They need a better way to prove that AI-generated changes are safe to ship.
That is the core argument of this piece:
- Code generation is improving fast.
- Verification is still the bottleneck.
- The winning teams will invest in evidence systems, not just better prompting.
If this “dark factory” conversation is new to you, start with these two posts first:
- Dan Shapiro on the five levels from spicy autocomplete to software factory: danshapiro.com
- Simon Willison’s analysis of StrongDM’s approach and where verification becomes the real story: simonwillison.net
The recent “software factory” discussions making their way across the many AI and engineering blogs is useful because it makes this bottleneck visible. Dan Shapiro gives a clear maturity model for AI-assisted development, from “spicy autocomplete” to full autonomy. Simon Willison’s analysis of StrongDM’s work highlights the harder question behind the hype: not “can agents write code?” but “how do we know the output is correct?”
Use levels as diagnostics, not destiny
Shapiro’s levels are helpful when treated as a diagnostic tool. They give teams language for the shift from “AI helps me type faster” to “my job is now orchestration, validation, and judgment.”
They are less helpful when treated as a maturity ladder every team must climb. Different products, risk profiles, and compliance requirements should lead to different stopping points.
For engineering leaders, the practical use of the model is this:
- Identify your current operating mode honestly.
- Identify the current bottleneck (usually review and validation latency).
- Invest in the bottleneck before pushing for more autonomy.
What the factory metaphor gets right
The factory metaphor is directionally right in three ways.
- Specs matter more than before. Ambiguous intent gets amplified by agents.
- The role of engineers shifts toward system design, evaluation design, and risk management.
- Small teams can gain disproportionate leverage when verification is strong.
StrongDM’s published framing also reinforces a useful idea: autonomous code generation is only credible when connected to strong scenario-based validation.
Source: StrongDM Factory
Where the metaphor breaks
The metaphor becomes risky when it implies that humans are no longer needed in software delivery.
In practice, “no human review” is only defensible if review is replaced with something stronger than traditional PR review for the risk you are taking. As described by Simon Willison, StrongDM emphasizes scenario testing, holdout-style checks, and probabilistic “satisfaction” rather than only line-by-line human inspection.
Source: Simon Willison
That is not a no-human system. It is a human-repositioned system:
- less effort in hand-writing implementation
- more effort in defining constraints
- more effort in evaluation harnesses
- more effort in operating rollback and incident paths
The operating model that scales
For most engineering organizations, the scalable model is:
- Generation pipeline: agents propose diffs from tasks/specs.
- Verification pipeline: independent checks accept or reject diffs.
- Governance pipeline: humans set policy, thresholds, and escalation rules.
This model is more robust than “agent writes code, reviewer glances at PR” because it makes quality gates explicit and repeatable.
If you want autonomy, you have to pay for evidence.
Evidence usually means:
- scenario and regression harnesses
- specs and documentation quality
- observability with release-level attribution
- sandboxes and safe rollback mechanisms
- boundaries for where full autonomy is allowed
What to do in the next 90 days
For engineers:
- Turn recurring production bug patterns into automated end-to-end test scenarios.
- Build at least one holdout-style eval set agents cannot see while generating.
- Measure first-pass acceptance and post-release rollback rates, split by change source (human vs agent-assisted).
For engineering leaders:
- Define autonomy tiers by system criticality.
- Require explicit quality gates before increasing autonomy in any tier.
- Measure review latency, escaped defects, and rollback rate as first-class metrics.
Where human review should remain mandatory
Even with strong automation, keep mandatory human review for:
- security-sensitive changes
- data integrity or migration logic
- complex concurrency behavior
- compliance and policy-critical workflows
- incident-response and rollback playbooks
Autonomy is not a binary switch. It is a scoped capability that should expand only when evidence quality improves.
Bottom line
The goal is not to eliminate engineers. The goal is to move engineers to the highest-leverage work: defining outcomes, designing boundaries, and deciding what to trust.
Call it a factory if you want. Operationally, it is an evidence engine.