3.5 Months Training One AI Assistant Across Shell, Python/Rust, Rust, and Terraform/Ansible

Why I did this

Most advice about AI coding assistants comes from people who ran them on one demo repo in one language for two weeks. I spent the last 3.5 months doing the opposite: building a single, portable assistant configuration — in both Claude Code and Cursor — and forcing it to survive four genuinely different codebases: a shell tooling repo, a Python/Rust mixed project, a pure Rust codebase, and a Terraform/Ansible infrastructure repo.

I did it because I spent 25 years running production systems, and the way teams were adopting these tools looked exactly like every unguarded production rollout I've ever been called in to clean up after: no cost visibility, no review standards, and agents one keystroke away from infrastructure they had no business touching.

This is what I learned. The config itself isn't public — but the lessons are, and they're the part most teams are missing.

The shape of the work, by the numbers: ~3,000 commits across the config and training repos over the period, 34 reusable skills, 13 per-language standards documents, 14 hooks wired into the session lifecycle. Daily-driver use — multiple sessions a day, every day, across all four stacks — not a side experiment.

The shape of the thing

The architecture that emerged, after a lot of rework:

.claude/
├── CLAUDE.md            # global identity, behavior, instruction priority
├── settings.json        # model pin, env, permissions, hook wiring
├── standards/           # 13 per-language reference docs (rust, python, shell, terraform, ansible, …)
├── skills/              # 34 reusable workflows (pr-review, tdd, security-review, bug-scan, …)
├── hooks/               # 14 lifecycle scripts (PostToolUse, SessionStart, Notification)
├── memory/              # persistent cross-session knowledge corpus
├── scripts/             # utility scripts (prefill profiler, memory query, DoD logger, …)
├── cost-log/            # per-session token spend tracking
├── dod-log/             # definition-of-done machine evidence (SHA-bound)
├── retrospectives/      # post-incident and post-session learning records
├── loop-log/            # subagent iteration telemetry
├── aggregator-log/      # learnings aggregator output
├── mutation-log/        # mutation testing records
├── postmortems/         # incident writeups
├── review-cycle/        # PR review gate logs
├── triage-log/          # bug triage records
├── renovate-log/        # dependency update handling
└── renovate-triage/     # dep update decision records
.cursor/
└── …                    # equivalent surface for the second tool

The split that mattered most: global config vs. per-repo config. Global holds identity, safety posture, and cost controls; per-repo holds language norms, project conventions, and the gates appropriate to that repo's blast radius. Getting this boundary wrong was the source of most early pain.

An early version kept language-specific learnings — "use this clippy lint", "prefer this Python test runner", "the cargo toolchain on this machine is nightly" — as global memory entries. Two failure modes followed me into every repo. First, the shell repo started getting unsolicited Rust advice on every session start, because the model treated a global memory as universally relevant. Second, a stale "nightly toolchain" entry caused a pure-Rust crate to fail CI for a week before I traced it back to the wrong global note. The fix was structural: language learnings get routed to standards/<lang>.md files that load per repo via CLAUDE.md @-includes, and global memory only holds things that are genuinely identity-level — the user's role, the user's preferences, durable workflow rules. The hygiene rule that came out of it: if a memory entry has a language name in it, it does not belong in global.

What survived contact with all four languages

These are the pieces that proved genuinely portable — the load-bearing walls.

1. A structured, phased review chain

Free-form "review this" prompts produced confident slop. What worked was decomposing review into a chain of discrete skills — each with a single mandate, run in order, with a HOLD at any step blocking everything downstream.

Step 1  — pr-review            detect scope, load standards, six-phase review, PASS/HOLD verdict
Step 2  — perf-regression      Criterion benchmark delta vs. master baseline (math/etch-cli only)
Step 3  — security-review      secrets, injection, privilege, dependency provenance
Step 4  — dependency-review    license (GPL/AGPL = HOLD), maintenance health, slopsquatting
Step 5  — test-quality-review  trivial assertions, mock-return assertions, missing error paths
Step 6  — diff-scope-review    scope creep, unintended file inclusion, blast radius
Step 7  — bug-scan             behavioral correctness gate — logic errors, silent failures
Step 8  — docs                 README, CLAUDE.md, examples synced with the change
Step 9  — learnings            patterns and gaps captured to memory and standards
Step 10 — finish               merge or PR creation; CI monitoring until green

Within pr-review (Step 1) there are six internal phases — security, TDD coverage, logic and correctness, code quality, documentation, and IaC safety — each a separate checklist so the model can't blur them. The chain grew from a single six-phase skill to ten discrete skills over the period; the major additions were the dependency provenance check (after a slopsquatting near-miss), the behavioral correctness gate (after a class of logic bug that passed all tests), and the diff-scope review (after a git add -A swept in unintended files).

2. PostToolUse hooks as the enforcement layer

The single highest-leverage idea: stop asking the model to behave and start making the environment enforce it. Lint, format, and safety checks fire after every edit, mechanically. The model's compliance becomes irrelevant.

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit|MultiEdit",
        "hooks": [
          { "type": "command", "command": "~/.claude/hooks/format-on-edit.sh" }
        ]
      }
    ]
  }
}

The script behind it is twenty lines: dispatch on file extension, run rustfmt on .rs, shellcheck + shfmt on .sh, ruff --fix on .py, terraform fmt on .tf. The model never has to remember to format. The format is just always correct, because the environment enforces it on the way out.

3. Cost governance as config, not policy

The expensive failure modes are config-level and findable. The one that bit me: Claude Code silently promoting sessions to the premium 1M-context model variant. On a Max plan, Sonnet's 1M variant bills extra — always. The fix is one line:

{ "env": { "CLAUDE_CODE_DISABLE_1M_CONTEXT": "1" } }

The 1M variant carries roughly a 2× input-token rate above the 200K threshold — every long session was silently billing at premium rates. Disabling auto-promotion eliminated the surcharge across all sessions on the affected machine. I followed it up with a SessionStart hook that watches for a [1m] suffix in the model identifier — explicit pins in project-level settings would otherwise bypass the global env var. The hook fires a push notification and injects a warning into the session context, so the next misconfiguration shows up the moment a session starts, not on the next billing cycle.

The broader pattern: context is billed every turn, so Context Engineering is cost engineering. Bloated always-on CLAUDE.md files, unused MCP servers injecting tool definitions, undisciplined session sprawl — each is a silent tax. Context Engineering — the discipline of dynamically assembling exactly what the agent needs, nothing more — is not a UX concern; it is a spend control.

After the standards extraction pass — moving 13 language-specific rule sets out of memory and into @-included files that load only when relevant — the always-on global CLAUDE.md surface dropped by more than half. The savings compound: every turn re-bills the system prompt, so a one-time trim pays out for the entire lifetime of every future session.

4. Notification-driven long-running work

The fourth load-bearing pattern was treating the model as an async worker. Long-running tasks — a PR review after gh pr create, a stale-branch sweep at session end, a coverage run, a multi-crate test loop — are all wired to push notifications via a single notification hook. A session can kick off a 20-minute job, I close the laptop, and the phone tells me when it lands. The corollary: hook chains run unattended, so they must be idempotent and must fail closed. A hook that silently exits 0 on error is worse than no hook — it costs you the assumption that the gate ran.

What didn't survive (the graveyard)

The section nobody writes because theorists don't have one. Things I built, believed in, and deleted:

Trusting line-coverage parity across OSes

In the Python sub-projects, a single coverage invocation was the floor of the test gate and worked the same on macOS and Linux. I assumed Rust would behave the same way once cargo-tarpaulin was wired up. It does not. Linux ptrace counts more lines as coverable than the macOS tarpaulin backend — fn main() body lines, every separate argument on a multi-line macro, and explicit break; statements all register as instrumented probes that never fire under test. The same crate passed at 95% line coverage locally and failed at 87% on CI. The fix wasn't writing more tests; the fix was understanding the instrumentation: exclude fn main() with #[cfg(not(tarpaulin_include))], collapse multi-line macros to single lines, rewrite explicit break; into a loop-condition predicate. That whole catalog now lives in standards/rust.md and is loaded into the session before any Rust review runs. The lesson generalizes: coverage tools have OS-specific instrumentation surfaces, and the model will not figure that out on its own.

Defaulting the cheap model

I switched the default model to the cheap tier for a day, on the theory that most edits don't need the flagship's reasoning. Output quality on the multi-file refactor work fell off a cliff — not subtly, immediately. Plans that previously came back ordered and complete came back partial and incoherent. Reverted within hours. The honest lesson: the cost-controls win is not "pick a smaller model by default", it's "kill the silent upgrade to a more expensive variant of the model you already chose". One of those is a quality compromise. The other is free money.

One config to rule both tools

The original goal was a single config that drove Claude Code and Cursor identically. That dream survived about four weeks. The two tools have different hook lifecycles, different skill/rule semantics, different ideas about what counts as a session, and Cursor auto-generates SDK stub files into the working tree that have to be .gitignored out of the way. The portable artifacts are the standards documents and the language conventions — those are pure markdown and read identically by both. The tool-specific surface (hooks, settings, skill metadata) split into parallel directories and stayed split. Round goes to Claude Code on hook expressiveness and to Cursor on inline review UX; neither is a strict superset.

The language deltas: why one config can't rule them all

The most transferable finding: the assistant is portable, the guardrails are not. Each stack needed different gates because each stack fails differently.

Shell — shellcheck on every write, no exceptions, plus a hook that refuses to run rm -rf against an unquoted variable expansion. Shell is the language where a confident, plausible-looking one-liner does the most damage with the least friction. The defense is mechanical: lint on write, refuse on a known-bad pattern, and forbid set -e outside git hooks because its interaction with conditionals and pipes silently swallows the failures you most need to see.

Python/Rust (mixed) — Per-directory toolchains: Python lives at the top of each sub-project, Rust lives in a <name>-rs/ sibling. Hooks dispatch on file path, not extension alone, because the same repo has two coverage tools, two test runners, two formatters. On the Python side, the toolchain consolidated over the period: ruff replaced both black and isort (lint, format, and import sorting in one tool), pytest replaced python3 -m unittest (discovers existing TestCase subclasses without rewriting them), and pyright plus mypy both run — pyright for fast IDE feedback, mypy as the strict CI gate. mutmut and hypothesis were added for mutation testing and property tests respectively. The boundary cases the model got wrong, repeatedly, until the gates were in place: applying Python idioms inside the Rust crate (catching exceptions that don't exist), and reaching for ProcessPoolExecutor patterns on macOS where the spawn context leaks worker processes if you do not gate it behind an input-size threshold.

Rust — Pure Rust gets the cheapest verification oracle of the four: the compiler. cargo clippy -- -D warnings in a PostToolUse hook plus cargo nextest as the test runner means most errors surface in seconds — the model edits, the gate runs, the model sees the diagnostic, and the loop converges. cargo machete was added to catch unused Cargo.toml dependencies that clippy misses. Coverage via cargo tarpaulin runs in CI only — not in the pre-push hook — because tarpaulin and nextest don't mix (tarpaulin still calls cargo test directly). The expensive lesson here was upstream of the assistant: a cargo fmt exit code was being swallowed by the wrapper script, so the format gate looked clean even when files were dirty. Propagating the exit code through the wrapper was a six-line fix that closed a hole the model would never have noticed on its own. Borrow-checker fights are rare once standards/rust.md documents the common clippy patterns; the assistant's failure mode is more often producing technically-compiling code that violates lints the codebase already enforces.

Terraform/Ansible — the highest-stakes repo, and the reason "safety gates" is in my positioning rather than a buzzword. Code-repo mistakes cost a revert; IaC mistakes cost an outage. The gates here are categorically stricter: plan-before-apply enforcement, protected-resource patterns, and a hard rule that the agent proposes but never applies.

The clearest single illustration: a baseline sweep across the Ansible side of the infrastructure repo, 33 roles touched in one change. The standard linter — ansible-lint — reported clean. The gate I'd added on top of it was a community-of-practice rule set: a second pass that enforces conventions the basic linter ignores. It caught a category of finding that would have been invisible otherwise. One role responsible for managing several user accounts had been written with task-level loops where each task hard-coded my own username as the target. The role had been "working" for months because the failure mode was silent — the convergence happened, the playbook reported success, and the wrong account got modified on every run. The CoP gate flagged it as a variable-substitution violation. The same sweep also caught: a config template with the wrong comment-style argument for a parser that is format-sensitive about leading characters and would have rejected the deploy at runtime, a data directory with permissions tight enough to block the container that owned it from writing, and a package-source GPG key URL that fell back to an old, since-rotated key because the intended variable was undefined.

None of those would have shown up in a code-only review. The point of the IaC gate isn't to catch syntax — the language servers already do that. It's to catch the failures that look like success: silent variable substitution, wrong-mode permissions, undefined fall-backs to deprecated state. Across the period, the IaC gates fired against more than 30 distinct role/module changes — the count isn't the point, the category is: every catch was a thing that would have looked green and broken something in production.

What I'd do differently starting today

If I were starting a team at month zero instead of month 3.5:

Wire enforcement hooks on day one. Lint, format, and safety checks belong in the environment, not in the prompt. The model's compliance is not your safety story.
Turn off silent model upgrades before the first real session. The 1M-context auto-promotion is the obvious one, but the general rule is: audit every config knob that can move the model to a more expensive variant without telling you.
Scale gate strictness to blast radius, and start with IaC strictest. The instinct is to harden the code repos first because that's where the most edits happen; the right call is to harden the infrastructure repo first because that's where the most damage happens.
Treat per-language standards as the only acceptable place for language conventions. Global memory is for identity and durable workflow rules — nothing else. Anything language-specific in global memory will leak into the wrong repo eventually.

The meta-lesson sits above all four: AI coding is now a platform-engineering problem, not a prompting problem. The teams getting compounding value aren't the ones with clever prompts — they're the ones who built the boring infrastructure around the model: enforcement, review gates, cost visibility, and blast-radius-appropriate permissions. Which is to say: the same discipline production systems have always needed.

I help engineering teams adopt AI coding agents safely — with the cost controls, IaC safety gates, and CI discipline of someone who ran production systems for 25 years. If your team is somewhere in month zero of this and would rather start at month 3.5: brucejacksonconsulting.com.