Agentic Bug Fixing: /undefect

I've got a skill that drastically improves code quality and architecture and it's powered by defects.

I'm sure you've already had many scenarios where you've had your agent fix the same bug 3-4 times. That's not just a defect, it's a systemic failure.

Sure, we want to fix the bug -- but it's even better to fix the problem that caused the problem: to learn what went wrong and what conditions made us bug-prone. Fixing the bug isn't as powerful as fixing the conditions that created it.

That's what /undefect is for: it investigates a defect deeply and makes it easy to detect -- without fixing it yet. (If you'd rather see it work before reading the design, skip to what a run looks like.)

You can grab the skill from my skillz repo on GitHub, or you can click here to expand it here.

---
name: undefect
description: >-
  Investigate a software defect without fixing it. Use when the user wants a
  complete bug investigation, root-cause explanation, early detection measures
  through types/tests/logging/user feedback, verification that each detection
  measure catches the defect, architectural recommendations, and a scan for
  similar defects from similar causes, while explicitly preserving the defect
  for a later fix.
---

# Undefect

Use this skill to understand a defect deeply and make it easy to detect, without fixing the defect yet.

## Prime Directive

Do not fix the defect as the outcome of this skill.

If investigation requires temporarily fixing the defect to prove causality, do it only as an experiment. Before finishing, revert the temporary fix and leave only detection, observability, documentation, or user-feedback changes that do not correct the defect itself.

Our main goal is to either eliminate the potential for this class of defect, or failing that, make detection simpler and faster.

## Workflow

1. Establish the defect precisely.
   - Reproduce it from the user-visible behavior, a failing command, a failing test, logs, or a minimal code path.
   - Record the expected behavior, actual behavior, trigger conditions, affected surface area, and any non-reproduction cases discovered.
   - Prefer local deterministic reproduction before production reproduction. Avoid production mutation unless the user explicitly allows it.

2. Explain why the bug exists.
   - Trace the code path from input/event to wrong outcome.
   - Identify the missing, incorrect, or duplicated responsibility that permits the defect.
   - Before calling it a local bug, ask whether the failing behavior is caused by two sources of truth, duplicated policy, split ownership, or drift between surfaces that should share one contract.
   - If the defect involves ordering, prioritization, authorization, validation, naming, state transitions, derived status, or other policy, explicitly identify the canonical owner that should decide that policy. If no clear owner exists, call that out as part of the root cause.
   - Distinguish facts from inferences.
   - If useful, prove causality with a temporary experiment, then revert that experiment.

3. Add the fastest local detection mechanism that can reveal the defect without fixing it.
   - Prefer compile-time/type-checking when realistic.
   - Otherwise prefer a focused unit/integration test.
   - Use end-to-end tests only when lower-level tests cannot express the failure.
   - Prefer detectors at the contract boundary or canonical policy owner over detectors in only one consumer. For drift bugs, add a detector that compares sibling surfaces or forces them through the same shared contract.
   - Keep the detector narrowly scoped to the defect.
   - Run the detector and confirm it fails or signals because the defect still exists.

4. Add the fastest production/system feedback mechanism that can reveal the defect without fixing it.
   - Prefer structured logging for invisible backend failures or missing side effects.
   - Add user feedback when the user can recover, retry, or needs to know work did not happen.
   - Avoid redundant detectors. If a type check makes a unit test unnecessary, or a test makes extra logging unnecessary, skip the redundant measure.
   - After each measure, trigger the defect and confirm the measure fires.

5. Preserve silent-failure prevention.
   - The final state should not allow the defect to remain completely silent for both developers and users.
   - Aim for one fast local signal and one fast production/user-facing signal.

6. Look for enabling architecture problems.
   - Identify refactorings that would make this class of defect harder to create.
   - Favor structural improvements: single source of truth, typed contracts, centralized side-effect helpers, parameterized APIs, exhaustive switches, shared adapters, or stronger invariants.
   - When there are multiple implementations of the same business policy, recommend combining them behind one owner instead of patching each caller independently.
   - When behavior differs by surface (UI, CLI, API, worker, integration), decide whether the difference is intentional. If it is not intentional, recommend a shared contract and list every surface that should consume it.
   - Present these as recommendations unless the user asks to implement them.

7. Look for similar defects from similar causes.
   - Search for nearby code paths, sibling features, duplicate implementations, adapter variants, route variants, and test gaps that share the same missing, incorrect, or duplicated responsibility.
   - Use the root cause as the search key, not just the symptom text.
   - For each plausible similar defect, report whether it is confirmed, ruled out, or still uncertain.
   - Add detection for confirmed similar defects only when it fits the user's requested scope and does not fix the defect.
   - Do not broaden into unrelated cleanup; keep the pass focused on defects enabled by the same cause.

## Verification Rules

After every detection or feedback measure:

- Trigger the defect again.
- Report the command, test, log line, or UI feedback that proves the detector worked.
- If the detector does not fire, improve it or explain why that detection route is not viable.

Before finishing:

- Re-run the fastest detector.
- Confirm the defect is still unfixed unless the user explicitly changed the goal.
- Confirm any temporary fix experiments were reverted.

## Final Response Shape

Lead with the root cause and the detection measures added.

Include:

- Why the defect exists.
- What detection/feedback was added.
- How each detector was verified.
- Whether the defect remains intentionally unfixed.
- Any architectural refactorings recommended for a later change.
- Similar defects from similar causes that were confirmed, ruled out, or left uncertain.
- Any tests or commands that were not run.

What a run looks like

Here's a (lightly fictionalized) run on a drift bug -- the case /undefect is most interesting on -- as it actually scrolls by in the terminal:

> /undefect customers see "free shipping unlocked" at a $50 cart subtotal,
  but checkout still charges them $6.99 shipping

● I'll investigate without fixing it: reproduce, find the root cause, and
  leave behind detection that fails *because* the bug is still live.

● Search(pattern: "free.?shipping|shippingCost|FREE_SHIPPING")
  ⎿  6 matches across 4 files

● Read(cart/summary.js)
  ⎿  read 24 lines
● Read(checkout/shipping.js)
  ⎿  read 18 lines

● Reproduced. The threshold lives in two places and they disagree at the
  boundary:

      cart/summary.js       subtotal >= 50   -> banner says "free"
      checkout/shipping.js   subtotal > 50    -> charges $6.99

  At exactly $50 the cart says free and checkout charges. This isn't a local
  bug -- it's two sources of truth for one pricing policy, with no canonical
  owner. Both surfaces re-derive the rule, so they're free to drift.

● Adding a local detector at the contract seam -- it forces both surfaces to
  agree on the same input instead of testing just one consumer.

● Write(shipping-policy.contract.test.js)
  ⎿  wrote 6 lines

● Bash(npx vitest run shipping-policy.contract.test.js)
  ⎿   FAIL  shipping-policy.contract.test.js > banner and checkout agree at 50
          expected false to be true
          - Expected:  true   (banner: free)
          + Received:  false  (checkout: charged $6.99)
        Tests  1 failed | 5 passed (6)

● Good -- red, and only at 50. The reproduction is now nailed to a test.

● Adding a production signal so the silent version of this stays loud: a
  structured log when the two surfaces disagree at request time.

● Update(checkout/shipping.js)
  ⎿  +3 lines (detection only -- does NOT change the charge)

● Bash(curl -s localhost:3000/checkout -d subtotal=50 >/dev/null; tail -1 dev.log)
  ⎿  WARN shipping_policy_drift {"subtotal":50,"banner":true}

● Fires. The $50 cart is still charged $6.99 -- I have not fixed the bug.

● Sweeping for the same *cause* ("threshold re-derived per surface"), not the
  symptom text:
  ⎿  email/receipt.js             CONFIRMED  hardcodes `subtotal > 50`
     promo/eligibility-worker.js  CONFIRMED  uses `>= 50` AND a stale `45` flag
     mobile/CartScreen.tsx        UNCERTAIN  calls /cart/summary -- needs a trace
     admin/order-export.csv       RULED OUT  reads persisted shipping_charged

● Done. The defect is still live, and now it's cornered:

  Root cause   Two sources of truth for the free-shipping threshold, no
               canonical owner. `>=` vs `>` drift breaks the $50 boundary.
  Detection    1 failing contract test (red at $50) + 1 drift log line.
  Verified     Watched the test fail for the right reason; saw the log fire.
  Still broken Yes -- $50 carts are still charged. Fix intentionally deferred.
  Recommend    One owner: shippingPolicy.qualifiesForFree(subtotal), called by
               both surfaces. The contract test then guards the seam forever.
  Similar      2 confirmed, 1 uncertain, 1 ruled out (above).

One $50 bug, cornered by a test I watched go red -- and three more places the same policy is duplicated, surfaced by the sweep.

Here's the thinking behind the design.

Why forbid the fix?

Like I said, this skill purposely leaves the defect unfixed. That's so you can check whether the tests or other measures actually find the defect. I can't tell you how many times I've seen people write tests after the fix, and they get coverage but don't actually ensure the defect can't happen again. You have to write your tests first and see them fail in order to "test your tests".

The moment a bug is fixed, the agent loses its reproduction. There's nothing left to point a new test at, nothing to confirm a log line fires for, nothing to prove an architectural change actually closes the hole. The defect is the only ground truth you have, and a fix destroys it.

So /undefect keeps the bug alive on purpose, and uses it as a target to verify against. Every detection measure it adds gets run against the live defect to prove it actually fires. By the time the skill finishes, you have a failing test (or a type error, or a log line) that you've watched fail for the right reason.

Root cause, not symptom

The symptom above was "checkout charged $6.99 at a $50 cart." The cause was "we handle pricing differently in two different places instead of having a single clear source of truth."

Once you've got the root cause, you've identified a pattern that allows you to ask:

Are there other possible defects caused by this pattern?
Do we have any safety measures in place to help eliminate/prevent/detect this pattern in the future?

You want to be able to inoculate both cross-codebase and into the future. Every defect is an opportunity to improve the system that the agent is working in.

One fast signal locally, one in production

Steps 3-5 are a deliberate pair. A test that catches the bug on your laptop does nothing for the version of this bug that happens silently in production six weeks from now. A log line does nothing for the version a teammate reintroduces before they ever deploy. So the skill aims for both: one fast local detector and one fast production/user-facing signal, and explicitly tries to avoid the defect ever being completely silent again.

It also tries not to pile on redundant detectors. If a type makes the bug impossible to express, you don't also need a unit test for it. The cheapest, fastest signal that actually fires wins.

Verify everything fires

I don't trust a detector I haven't watched fail. Neither should the agent. The verification rules force it to trigger the defect again after each measure and report the exact command, test, or log line that proved the detector worked. This is the same instinct as TDD's red step; a test you've never seen fail is a test you can't trust.

LLMs are especially prone to writing a confident-looking test that passes for the wrong reason. Making the defect prove the detector closes that gap.

The similar-defects sweep

Step 7 is the payoff for doing root-cause analysis properly: The same missing validation probably exists in the sibling route. The same drift probably exists on the CLI that the UI just exposed. So the skill goes looking, and reports each candidate as confirmed, ruled out, or uncertain.

This is where one bug investigation quietly prevents three more.

Why this is a separate skill from fixing

You could imagine folding all of this into a normal "fix the bug" flow. I keep it separate on purpose. Fixing is goal-directed and wants to converge fast; investigation wants to stay open and suspicious. Mixing them means the convergence pressure of "make it green" wins, and the investigation gets skipped. By making /undefect a thing you run before you fix -- and one that refuses to fix -- the investigation actually happens.

Better Taste == Better Autonomy

Concretely, this is what changes for me day to day: I can hand the agent a list of 20 defects, tell it to run /undefect on each, implement the recommendations, and fix the bugs -- and trust that what comes back is investigated, not just patched.

/undefect is really just my taste in debugging, written down -- the same way /defrag is my taste in refactoring. I highly doubt that agent autonomy will be sufficiently achieved through bigger/better models alone. We need to inject our human taste into the recipe or be ever-present babysitters. Every skill like this is one more thing you no longer have to be in the room for.

Tags: ai, agentic, debugging

← Back home