The cycle
① Audit
→
② Orient
→
③ Build
→
④ Stay Sharp
→
↻ repeat at every model release
Before you build
① Audit
go/no-go test — part A (kill)
- killCould this be replaced by one multimodal prompt + 1–2 tool calls or an MCP?
- Is this Software 1.0 plumbing — an orchestration layer, API wrapper, or UI shell — around something the model now does natively?
// if any answer is YES →
Stop or pivot. You are building plumbing the next release will absorb.
go/no-go test — part B (build)
- buildDoes this only exist because of Software 3.0 — genuinely impossible before LLMs?
- Does it embed genuine domain expertise or proprietary context the model can't replicate alone?
- Does it solve something the model alone cannot?
- Will it still pass both tests after the next model release?
// if all answers are YES →
Proceed. This is worth building.
② Orient
Finding your verifiable niche
- Listed the domains where I have genuine deep expertise
- Confirmed outputs are objectively verifiable: correct/incorrect, better/worse, faster/slower
- Checked: are frontier labs specifically applying RL to this niche? (If no → window is open)
- Mapped the RL environment: defined reward, penalty, and what "better" means concretely
- Validated niche with a well-prompted agent before committing to fine-tuning
- Written a clear thesis: what I'm building, why it passes the S3.0 test, what makes it defensible
Knowledge brain set up — runs across all phases
- Created a dedicated folder for strategy/knowledge documents
- Written an initial context prompt covering company, product, domain, and goals
- Generated foundational docs (market positioning, technical direction, opportunity map) as markdown
- Habit established: new ideas go through the brain before committing
During the build — Phase ③
③ Build
Before handing off to an agent
- Spec and plan written — "done" is defined before the agent starts
- Task is scoped tightly — one agent, one contained objective
- Pass/fail criteria defined — tests, expected outputs, observable behaviour
- Asked: what is the text I copy-paste to the agent? (Not a script for a human to run)
③ Build
While agents are running
- Context window is clean — stale or bloated context actively removed
- Running ≤ 3–4 parallel agents (human review bandwidth, not a technical limit)
- Checking in regularly — not letting any agent run open-endedly
- Knowledge brain provided as context where domain relevance matters
③ Build
After — reviewing agent output
- Reviewing output as a draft, not a deployment
- Checked for: bloat, excessive copy-paste, brittle abstractions, silent logic errors
- Unit tests and smoke tests pass
- CI blockers in place — bad code cannot reach production automatically
Agent-first infrastructure
③ Build
External-facing product
- llms.txtAdded to root — tells agents what the product does, how the API works, how to trust it
- Docs are agent-readable: clear structure, no assumed context, direct instruction format
- Core functionality exposed via MCP, clean API, or agent-compatible tool calls
- Asked: can an agent understand and use this within a single context window, without human translation?
③ Build
Internal tooling
- Human UI stripped where an agent, not a person, is the primary consumer
- Data structures are LLM-legible: descriptive field names, explicit schemas, no visual-layout dependency
- Internal docs written as agent briefs, not employee onboarding
- Agent-native benchmark passed: agent can complete primary workflow end-to-end, no human mediation
Staying sharp — Phase ④
④ Stay Sharp
Understanding practice
- Knowledge brain is being actively fed — articles read → agent updates relevant sections
- Regularly asking the brain: "What do I know about X? What are the gaps? What should I read next?"
- Capable of evaluating agent output critically — know what "good" looks like in this domain
- Building for where capabilities will be in 3–6 months, not only today's baseline
- Phase 1 audit scheduled for next major model release
Key tests at a glance
| Kill test | Can one prompt + tool calls replace this? |
| Build test | Does this only exist because of LLMs? |
| Agent-native test | Can an agent deploy and use this without human translation? |
| Niche test | Domain expertise + verifiable outputs + no lab RL coverage? |
| Understanding test | Am I outsourcing execution or understanding? |
| Setup test | What's the text I copy-paste to my agent? |
| Direction test | Can I evaluate what the agent produced, and do I know what good looks like? |
Core principles — reference
P1 · New paradigm
Don't ask: "how do I use AI to do this faster?" Ask: "what computing paradigm does this problem belong to?"
P2 · Verifiability is the moat
Domains with objectively measurable outputs support RL and fine-tuning. That's where durable advantages are built.
P3 · Everything is built for humans
Agents need sensors and actuators. Design clean data access and clean action paths — not human-facing UIs.
P4 · Outsource execution, not understanding
You can't be a good director without genuine domain understanding. That's the scarcest resource now.
P5 · Vibe coding has a ceiling
Production work needs agentic engineering: spec-first, structured review, tests, CI. 3–4 agents max you can review properly.
P6 · Ghosts, not animals
LLMs are statistical circuits, not intelligences. Expect jagged capability — highly capable in some areas, hard edges in others.