AI coding tools are now standard, but choosing among them — and proving they help — requires honest evaluation. The public benchmarks and your team’s actual experience often differ. This post is the working playbook.
Public benchmarks
| What it measures | |
|---|---|
| SWE-bench / SWE-bench Verified | Solving GitHub issues from real OSS repos |
| HumanEval / MBPP | Function-level coding (basic) |
| LiveCodeBench | Competitive programming style |
| TerminalBench | Terminal task completion |
| AiderBench | Multi-file edits |
SWE-bench is closest to “real engineering tasks” but still narrow. Use as relative ranking; not as absolute productivity.
Why public benchmarks mislead
- Tasks are well-specified: real engineering is “build me X for our system.”
- No tribal knowledge: real codebases have undocumented conventions.
- Single-shot: real workflow is iterative.
- Public test data: training contamination.
A tool with 70% on SWE-bench might be 30% useful on your gnarly internal codebase.
Internal benchmarks
Build one. Sample of 30-50 real tasks from your tracker:
- Bug fixes (with the actual issue text).
- Small features.
- Refactors.
- Test additions.
Run candidate tools. Score:
- Pass / fail: did the PR pass tests?
- Quality: would you accept the PR?
- Effort: human time spent prompting / reviewing.
Score by your team. Repeat quarterly as tools evolve.
Productivity metrics
What to track:
- Time-to-PR (issue created → PR opened).
- PR cycle time (opened → merged).
- PR throughput per dev per week.
- Developer-reported satisfaction (survey quarterly).
- Bug introduction rate (regressions per merged PR).
What NOT to track:
- Lines of code added — gameable; AI tools produce more lines, not better code.
- Commits per day — same.
- % AI-generated code — meaningless; quality matters, not provenance.
A/B with cohorts
Cohort A (10 devs): Cursor + Claude Code.
Cohort B (10 devs): control (existing tooling).
Track productivity metrics over 3 months.
Survey both cohorts.
3 months because Hawthorne effect; novelty wears off; real signal emerges.
In practice: most teams adopt rather than A/B. But for high-stakes decisions: A/B.
What seniors actually use
- Claude Code for repo-spanning refactors.
- Cursor for IDE-integrated single-file work.
- Aider / open-source alternatives for cost-control or self-hosted.
- Copilot for autocomplete (still strong here).
By 2026 most senior engineers use 2-3 in combination.
See Agentic Coding 2026 .
Quality signals
- Match repo conventions: imports, naming, idioms.
- Realistic tests: not just happy path.
- Error handling: covered.
- No suspicious dependencies added: agents sometimes add packages unnecessarily.
- Diff readability: small, focused changes vs sprawling.
Code review remains essential. AI tools shift the bottleneck from typing to reviewing.
Cost-effectiveness
Tool cost: $50/dev/month.
Senior dev cost: $200/hour.
Time saved required: 15 minutes per dev per month.
Trivial to clear if the tool is decent. Worry about quality, not cost.
Failure modes to watch
- Confident hallucination: code looks right; doesn’t work; reviewer trusts.
- Pattern flattening: agent uses generic patterns; ignores your codebase’s idioms.
- Test gaming: tests “pass” by mocking the very behavior they should test.
- Subtle bugs: race conditions, off-by-one, null handling — AI hits these less reliably.
- Refactoring gone wrong: large diffs that touch unrelated areas.
Mitigations: stricter PR review, focused prompts, specific scope per task.
Domain-heavy areas
AI tools struggle most with:
- Concurrency / async — subtle bugs.
- Distributed systems — global reasoning.
- Performance-sensitive code — optimization decisions.
- Domain logic with nuance — accounting, security, payments.
Don’t fully delegate these. AI tools assist; senior engineers own.
Eval framework
@dataclass
class CodingTask:
description: str
repo_state: str # commit SHA
success_criteria: list[str] # tests, lint, manual review
expected_files: list[str] # rough scope hint
async def eval_tool(tool, task):
branch = await tool.attempt(task)
return {
"tests_pass": await run_tests(branch),
"lint_pass": await run_lint(branch),
"files_in_scope": files_touched(branch) <= set(task.expected_files),
"review_quality": await human_score(branch),
}
Run this on each candidate tool. Quarterly.
Vendor risk
By 2026 the AI coding space includes:
- Anthropic (Claude Code).
- Cursor.
- GitHub (Copilot, Workspace).
- OpenAI (Codex).
- Continue.dev (open-source).
- Many others.
Don’t lock in deep. Standardize on patterns (CLAUDE.md, prompts, eval set) that work across tools.
Common mistakes
1. Evaluating once, never again
Tools improve monthly. Evaluate quarterly.
2. Prompt-engineered eval
You wrote prompts that favor one tool. Use real tasks instead.
3. Single benchmark obsession
“Highest SWE-bench wins.” Different tools shine in different real tasks.
4. No PR review changes
Same review rigor as before, ignoring that AI volume changes the workload. Adjust review process.
5. Productivity claims without data
“Claude Code makes us 2x faster.” Maybe. Measure.
What I’d ship today
For evaluating AI coding tools:
- Internal benchmark of 30-50 real tasks.
- Quarterly tool eval using it.
- Productivity tracking (cycle time, satisfaction).
- CLAUDE.md for codebase conventions.
- Standard PR review checklist.
- Boundaries: what AI tools don’t touch.
Read this next
If you want my AI coding eval framework + CLAUDE.md template, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .