Making AI write code like I do
Making AI write code like I do
I'm reviewing a PR from a colleague who used Cursor. The code works... the tests pass... but half my review comments are things we agreed on months ago - use structlog not print, pytest not unittest, type hints on public functions. The AI wrote perfectly functional code that completely ignores how we do things.
So someone suggests an AI code reviewer. Except the bot doesn't know how we do things either, so now we get fifteen comments per PR about "consider adding docstrings" and "this variable name could be more descriptive" - generic noise that has nothing to do with our codebase. Junior devs panic. Senior devs ignore the bot entirely. And the people who copy-paste ChatGPT's suggestions into their review comments? Worse, cos there's a human face on the generic advice.
The AI review that was supposed to save time is creating MORE work, which is NEVER my aim.
The Actual Problem
It's not AI reviewing code. It's AI reviewing code generically.
Every team has their own version of "good." We use structlog because have you tried searching unstructured logs in production at 3am? We use pytest because unittest and MagicMock drove us mad with its side effect patch nonsense. We skip type checking on private functions cos mypy in CI handles that. None of this is in any AI's training data.
Standards live in the worst possible places. In people's heads. In a notion page nobody's updated since 2022. In that one Slack message from man-who-left-years-ago about why we don't use class-based views. New devs learn through correction - write it wrong, get told, fix it. That works at human pace. Doesn't work when AI is generating code in every PR.
So I tried treating our standards like a product instead of a document.
What I Built
Three layers.
A standards repo. Markdown files organised by language. Python standards, TypeScript standards, an AI behaviour policy. Versioned, PR-reviewed, owned like a codebase. Want to change a standard? Open a PR. Not a wiki someone wrote years ago and forgot about.
An MCP server. Serves the standards to whatever AI tool your devs use. Our team uses everything - Claude Code, Cursor, Windsurf, a couple of people on more niche stuff like Aider. The MCP server doesn't care. Same standards, every tool. When someone asks their AI to write something, it already knows how we do things. "We use structlog" stops being a review comment and becomes context the AI has before it writes a line.
An automated reviewer. A Claude-powered GitHub Action that reviews every PR against those same standards. Detects the project language, posts one focused comment. Not fifteen comments about docstrings - one comment about things that actually matter to us.
Every review opens with a risk assessment - low, medium, high - based on blast radius, reversibility, whether it touches auth or migrations. That alone has been worth it because it gives the human reviewer something to triage on immediately. And it flags when docs need updating - your README, your CLAUDE.md - which is exactly the thing that falls through the cracks in human reviews. We check the code. We forget to check whether the docs still match.
Repos can drop a .github/review-standards.md to customise - extra rules, sections to skip cos CI already covers them, high-risk areas to flag. The shared standards give you the baseline, the overrides make it relevant to that repo.
The bit I'm most pleased with: repos call a shared reusable workflow from the standards repo, so the review logic and the standards live together. Change the standards, every future review across every repo picks it up. No copy-pasting prompts into twenty repos. No drift.
Does It Work?
PR comments became relevant - people actually read them. The bot went from "thing we dismiss" to "thing we read." Human reviewers stopped writing "use structlog not print" for the hundredth time and started looking at architecture and logic - the stuff humans I actually want human input on.
Writing the standards down also forced conversations we'd been avoiding. "Do we actually require type hints on private functions?" became a real discussion with a real outcome committed to a real file. Worth doing regardless of whether an AI ever reads it.
It's not magic though. The reviewer still occasionally says something daft. Standards go stale. And writing good standards is hard - vague standards produce vague reviews.
The Feedback Loop
Here's the interesting bit.
Early versions of our standards said we preferred pytest over unittest. Reasonable right? Except the reviewer took that and ran with it - flagging every PR that touched a unittest file. "Consider migrating to pytest." On a two-line bug fix. In a file with 400 lines of unittest that nobody was rewriting. Devs getting nagged about a migration that wasn't happening. So we changed the standard - pytest for new test files, follow existing conventions when adding to existing ones, and the reviewer should never flag it. The reviewer taught us our own standard was too blunt.
That's the pattern. The bot flags something too aggressively, that's a signal the standard needs refining. Teams skip a section consistently, maybe it's not worth enforcing. The system that enforces the standards is also generating the signal about whether they're any good.
Close the loop properly and it gets really interesting. The reviewer catches issues, the patterns inform the standards, better standards make the reviewer better, which means the AI assistants writing the code get better context too.. and suddenly you don't have a document someone maintains. You have a living thing that learns what "good" means for your team.