Code Review Prompts That Catch Real Bugs (Not Style Nits)
Most AI code review surfaces style issues humans can ignore and misses real bugs. Six prompts I use that flag what humans actually want flagged — race conditions, auth holes, schema migrations, the dangerous stuff.
Default AI code review is bad. It surfaces "consider adding a comment here" while missing the off-by-one that takes prod down. The problem is the prompts are too broad.
These six prompts are narrow. Each one looks for a specific class of bug. Run them in parallel and you get a useful review.
1. The race condition scanner
``` Scan this diff for race conditions and concurrency bugs.
Diff: {DIFF}
Specifically look for: - Shared state mutated without locks (in any language) - Async functions that read-then-write the same data - Database operations that should be in transactions but aren't - Cache invalidation that races with cache reads - Multiple processes/workers operating on the same row without locking
For each issue found, return: - File and line range - The specific concurrency pattern that's problematic - The smallest fix - Severity (1-3)
Ignore: style, naming, comments, formatting. Only flag concurrency. ```
The "only flag concurrency" instruction is the entire trick. Without it, the AI surfaces everything and the real concurrency issues get lost.
2. The auth/authz scanner
``` Scan this diff for authentication and authorization bugs.
Diff: {DIFF}
Specifically look for: - API routes added without auth middleware - Permission checks that reference the wrong user (current user vs target user) - Role checks that allow privilege escalation - Hardcoded credentials, API keys, or secrets - JWT or session handling that trusts client input - Row-level access that doesn't check ownership - File access without permission verification
For each issue found, return: - File and line range - The specific auth/authz mistake - An exploit description if applicable - Severity (1-3, with 3 being "could be exploited today")
Ignore: style, naming, comments, formatting. ```
The exploit description is the most useful field. It forces the AI to articulate why this matters in concrete terms.
3. The migration safety scanner
``` Scan this diff for database migration risks.
Diff: {DIFF} Database type: {POSTGRES/MYSQL/SQLITE}
Specifically look for: - ALTER TABLE on large tables without IF NOT EXISTS or zero-downtime patterns - Columns being dropped that may have active code references - Index additions that will lock the table on a large dataset - Foreign key changes that may break referential integrity - Type changes that might fail on existing data - ENUMs being altered in unsafe ways - Indexes being created without CONCURRENTLY - Default values added that will require a table rewrite
For each issue found, return: - File and line range - The specific migration risk - The downtime estimate or table-lock risk - The safer rewrite
Ignore: style, naming, comments. ```
Migration safety is where junior engineers learn the hard way. This prompt catches it before deploy.
4. The error handling scanner
``` Scan this diff for error handling problems.
Diff: {DIFF}
Specifically look for: - Try/catch blocks that swallow errors silently - Async functions without error handling - Errors logged but not surfaced - Generic catch blocks that catch too broadly - Errors thrown without context - Failure paths that should retry but don't - Failure paths that retry when they shouldn't (non-idempotent operations)
For each issue found, return: - File and line range - The specific error handling problem - The cost if this happens in production - The recommended fix
Ignore: style, naming, comments, formatting. ```
The "cost in production" framing makes this useful. Without it the AI just lists every try/catch.
5. The data leak scanner
``` Scan this diff for data leak risks.
Diff: {DIFF}
Specifically look for: - Logging of sensitive data (passwords, tokens, PII, financial info) - Error responses that include stack traces or internal state - Email/Slack/SMS sends that include unredacted sensitive info - Returns from API that include more data than the caller should see - Caching of sensitive responses - Test fixtures or seed data that include real PII
For each issue found, return: - File and line range - The specific leak risk - What sensitive data is at risk - The minimal fix
Severity 3 if real PII or financial data is involved. Severity 2 if internal state could be exfiltrated. Severity 1 otherwise. ```
Data leaks are the bugs that turn into legal calls. Worth a dedicated prompt.
6. The performance regression scanner
``` Scan this diff for performance regressions.
Diff: {DIFF}
Specifically look for: - N+1 queries (loop containing a database call) - Missing indexes for new query patterns - Unbounded loops over external API calls - Memory-loading-all-rows patterns instead of streaming/pagination - Synchronous work that could be async - New dependencies that add significant bundle size to client code - Removed indexes or caches that may have been load-bearing
For each issue found, return: - File and line range - The specific performance issue - The expected impact at scale - The fix
Don't surface micro-optimizations or "this could be slightly faster" unless the impact is significant at expected scale. ```
The "don't surface micro" instruction is essential. Otherwise you get 40 suggestions to use array.map instead of for loops.
My setup
I run all 6 prompts in parallel on every PR via a GitHub Action. Results aggregate into a single comment.
Each prompt uses Claude Sonnet 4.6. Cost per PR: about $0.30. Time: 8-15 seconds per prompt, parallel = 15 seconds total wall time.
The PR comment has 6 sections, one per scanner. If a scanner found nothing, the section is "no issues found." If a scanner found issues, they're listed with severity sorted.
Reviewers read the comment first, then the diff. Catches go from "human reads carefully" to "human verifies the AI flags + reads the diff for anything else."
What doesn't work
Running all 6 as one big prompt. The output becomes unfocused. The AI tries to be balanced across categories and surfaces 3 from each. With 6 prompts in parallel, each one is focused and aggressive on its own category.
Running them with no severity scoring. Without severity, every issue looks equal. With severity, reviewers can scan to the 3s first.
Running them without the "ignore style" lines. The AI defaults to noticing style. The negative instruction is what keeps it on the real bugs.
What this isn't
This isn't a replacement for human code review. Humans still need to read the diff for architecture, taste, business logic correctness, and the things AI doesn't see.
This IS a safety net that catches the bugs humans miss. Race conditions, auth holes, migration risks — these are the bugs that take down prod. Catching them at PR time is worth the $0.30.
What I'd build first
The auth/authz scanner. Highest-risk category. If you're shipping authenticated software, this is the first prompt to wire up.
Migration safety second. Performance regressions third. The others as you grow.
The full set takes a weekend to wire up. You'll pay for it on the first auth hole it catches.
Want the full guide? Check out our deep-dive page for more context, FAQs, and resources.
read the full guide