My Participation in the METR AI Productivity Study

METR recently released a paper, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity”. It was a randomized controlled trial where developers were given some tasks to work on using AI, and some without. The surprising headline result was that developers using AI took on average 19% longer to complete their tasks! (N = 246 tasks, 95% confidence interval ≈ [-40%, -2%])

I was one of the developers participating in this study, using jsdom as the project in question. This essay gives some more detail on my experience, which might be helpful for those hoping for insight into the results.

What I worked on

The jsdom project is an attempt at writing most of a web browser engine in JavaScript. It has some significant limitations and lots of gaps, but we get pretty far. Many people use it for automated testing and web scraping. It has just over 1 million lines of code in the main repository, with some other supporting repositories. A large part of jsdom development is trying to reproduce web specifications in code, and pass the corresponding web platform tests.

Since inheriting the project in 2012, these days I am the sole active maintainer. My main goal in recent years has been to respond to pull requests from community contributors. The METR study gave me an opportunity to put those aside, and write my own code to tackle the backlog of bug reports, feature requests, infrastructure issues, and test coverage deficits.

I was asked to assemble possible work items ahead of time for the study, of estimated size ≤2 hours. I ended up with 19 such work items. Each of them generated at least one pull request, as well as an “implementation report” where I wrote up what it was like working on that task, with a special focus on what it was like working with AI or not being allowed to use AI.

Expand to see the full list of issues, pull requests, and implementation reports

Issue	Task description	PR	AI?	Report
Issue	Update our URL parser for recent changes to the Unicode UTS46 standard and its URL Standard integration	PR 1 PR 2 PR 3	✗	Report
Issue	Small URL parser change to follow the latest spec changes	PR	✗	Report
Issue	Another small URL parser change to follow the latest spec changes	PR	✓	Report
Issue	Get code coverage of our URL parser to 100%	PR 1 PR 2 PR 3	✓	Report
Issue	Push a previous maintainer's draft PR for some basic SVG element support over the finish line (split into two chunks)	PR	✓ ✗	Report 1 Report 2
Issue	Investigate why our test suite was sometimes taking >70 seconds for a single test on CI	PR 1 PR 2 PR 3 PR 4 PR 5	✓	Report
Issue	Add linting to our locally-written new web platform tests	PR	✓	Report
Issue	Allow writing failing new web platform tests, to capture bugs we should fix in the future	PR	✗	Report
24 issues	Add test coverage for known bugs related to CSS selectors (some of which had been fixed, some of which were fixed by a new selector engine later)	PR	✓	Report
11 issues	Add test coverage for other known bugs, unrelated to CSS selectors (most of which had been fixed in the past or were fixed soon after the test appeared)	PR 1 PR 2	✓	Report
Issue	Add an option to disable the processing of CSS, for speed	PR	✗	Report
8 issues	Implement certain event classes or properties, even if the related spec was not fully supported	PR	✓	Report
Issue	Implement indexed access on form elements, like `formElement[0]` giving the 0th form control in that form	PR	✗	Report
Issue	Replace our dependency on the `form-data` npm package with our own implementation	PR	✗	Report
Issue	Fix `ElementInternals` accessibility getters/setters being totally broken	PR	✗	Report
Issue	Overhaul our system for reporting errors to the developer	PR	✓	Report
Issue	Use the HTML Standard's user agent stylesheet instead of an old copy of Chromium's	PR	✗	Report
Issue	Fix an edge-case using `Object.defineProperty()` on `HTMLSelectElement` instances	PR 1 PR 2	✗	Report

The issues are listed here in the order I worked on them. Total: 9 AI-allowed, 10 no-AI-allowed.

I did this work over the course of about a month, from 2025-03-15 through 2025-04-20, on weekends. The total time spent, measured by screen recordings (for both types of tasks), was 31.25 hours. I was compensated at $150/hour for my participation.

The screen recordings are worth calling out. Because of them, I was guaranteed to be “on” while working on these issues: I didn’t tab away or get distracted easily, because someone was always watching what I was doing.

How was the slowdown measured?

It’s important to note randomized controlled trials aren’t magic. Just like we can’t test a drug and placebo on the same patient, this study didn’t somehow have me working on the exact same tasks with vs. without AI. Instead, we try to average over a large-enough number of tasks so that, under reasonable assumptions about the underlying mechanisms, we can derive estimates and error bounds for the effect of the treatment.

Appendix D of the paper goes into more detail. They use a log-linear model, which is a reasonable model for task completion time and justified by the log-normal distribution of task times observed in the study (and elsewhere). The model is given as input the initial, pre-work time estimate we provided as a measure of task difficulty, as well as the treatment flag (0 for no AI, 1 for AI-allowed) and a random noise term. Various checks against the actual data confirm that this model makes sense: e.g., the model errors were not skewed systematically in any direction, and specializing the model to be different per-developer does not change the outcome much. The end result is that, with enough data, they are able to produce estimates for the slowdown, as well as the 95% confidence intervals.

My personal experience made me wonder: did they just get unlucky? For example, from the AI-allowed bucket, my performance optimization task ended up taking 4 hours 7 minutes, instead of my estimated 30 minutes; my write lots of tests task took 4 hours 20 minutes, instead of my estimated 1 hour. Maybe those tasks would have taken even longer without AI!

But this isn’t really the right way of thinking about it. There were many misestimates, in both directions, for both categories of task: e.g. from the no-AI-allowed bucket, this bugfix task took 6 minutes instead of the estimated 20; this infrastructure task took 90 minutes instead of the estimated 60. I think it’s better to trust the law of large numbers, and the power of well-structured statistical analysis, than to second-guess what might have happened in a different randomization setup. This is part of why the study’s authors emphasize that there is only good statistical power when you look at the results in aggregate.

My prior AI-coding experience

Prior to this study, I had not had significant experience with agentic coding workflows like Cursor’s agent mode.

A large part of this is due to my position on the Chrome team at Google, which means I am prohibited by policy from using most cutting-edge AI coding tools in my day job. Google employees are required to only ever use internally-developed Gemini-based tooling, not anything external like Cursor, Claude Code, or even GitHub Copilot. And the internal tooling that Google manages to develop always targets the private “google3” codebase first, not the Chromium open-source codebase where I work.

(With the release of Gemini CLI in late June 2025, we finally had something usable. But I gave it a try for a solid week and kept running into basic problems that other tools have already solved, like out of memory errors due to inefficient file-searching, or a file-patching tool that couldn’t handle whitespace.)

So prior to the METR study, I had only been able to spend weekend side-project time on AI-assisted coding. And during that time, I mainly used GitHub Copilot’s tab completion, plus the web interfaces for ChatGPT and Claude when I wanted to generate new files or functions completely from scratch.

That said, I’m skeptical of those who claim that this lack of experience was a major contributor to the slowdown. Agent mode is just not that hard to learn; the short training that METR provided, plus some pre-reading, felt like plenty to me. If you suspect AI is speeding you up instead of slowing you down, I think the differences more likely come from the other factors the study authors highlighted: small new codebases vs. large existing codebases; less-experienced developers vs. project owners and experts; and low AI reliability. The below writeup should give you more of a flavor on why I believe this.

My experience with AI during the study

It’s worth remembering the state of AI tooling in March 2025. Claude Code had just come out in research preview on 2025-02-24. (General release wasn’t until 2025-05-22, after the study, and first-class Windows support didn’t appear until a couple days ago.) Cursor’s agent mode only became the default on 2025-02-19. Delegation-centric tools like OpenAI Codex or Google Jules had not been released yet. Going forward, I think the best hope for efficiency gains will come from commanding an army of agents in parallel, but the METR study was not set up to measure such workflows: we worked on one task at a time.

The majority of the time I worked on AI-allowed tasks, it was with Cursor’s agent mode, with the model set to one of “auto” (Claude Sonnet 3.5, I believe?), Claude Sonnet 3.7 (thinking mode), or gemini-2.5-pro-exp-03-25. I never had the patience to use the “MAX” modes. I made one attempt to use the Claude Code preview, but gave up after wasting a decent amount of time because the Windows Subsystem for Linux networking bridge was preventing it from reaching my integration test server. I also went back to web chat interfaces a few times, e.g. to learn about the current state of Node.js profiling tools, or to ask o3-mini-high to microoptimize some specific string manipulation code.

I was most surprised at how bad the Cursor agent was at fitting into the existing codebase’s style. This was very evident when asking it to churn out tons of tests. Despite many examples in sibling directories, the models did not pick up on simple things like: include a link to the fixed issue in the test header; don’t duplicate the test name in the <title> and the test() function; reproduce the user’s reported bug exactly instead of imagining novel things to test; etc. And of course, the stupid excessive comments. On a greenfield or throwaway project, these things don’t matter much. We can just let the models’ preferences rule the day. But when fitting into a 1m+ LOC codebase, consistency is important. This meant that I had to continually check their work, and refine my prompt so that the next attempt would avoid the same pitfalls.

Eventually, for some of these repetitive test-writing tasks, I refined my prompt enough to get into a good flow, where they produced three or four tests in a row with no changes needed. (Even then, they kept failing to use Git for some reason, so I had to interrupt to commit each change.) But they would inevitably go off the rails, maybe due to context length overflow, usually in quite bizarre ways. In such cases restarting the session and copying my carefully-crafted prompt back in would get us back on track, but it wasted time.

My second biggest surprise was how bad the models are at implementing web specifications. This is most on display when I was implementing various event classes. Web specifications are basically computer code, written in a strange formal dialect of English. Translating them into actual programming languages should be trivial for language models. But the few times I tried to prompt the model to just implement by reading the specification did not go well. I can list a couple of contributing factors here:

The tool use was still sub-par. For example, web specifications are written as HTML, so simply pasting in a link like this one is not enough to get the resulting part of the specification into the context window, in a format like Markdown which the models are good at understanding. (This seems solvable if I code up my own tool.)
The models have strong, but outdated or wrong, priors for how web specifications are supposed to work. That is, old versions of these specifications were already in their training data, and then got lossily-compressed into the weights. So instead of implementing properly, by reading the specification text and then translating it into code, they seem to want to write the code off-the-cuff based on their existing priors.

This latter tendency was most hilariously on display when I got in an argument with Gemini 2.5 Pro Preview about how it should not make up a new constant CSSRule.LAYER_STATEMENT_RULE. Old CSS rules, like @charset, got such named constants (see the spec for CSSRule.CHARSET_RULE). New rules, like @layer, do not, since such numeric constants are a holdover from when people were designing web APIs as if they were Java APIs. But Gemini really, really wanted to follow the pattern it knew from its training data, and refused to implement CSS layers without also adding a CSSRule.LAYER_STATEMENT_RULE constant with the totally-hallucinated value of 16. I recommend reading its polite-but-firm sophistry about how even if the spec didn’t contain these constants, there’s some other “combined, effective standard” that includes this constant.

My feelings on AI-assisted productivity

In retrospect, it’s not too surprising that AI was a drag on velocity, while subjectively feeling like a speedup. When I go through the implementation reports, and notice all the stumbling and missteps, that’s a lot of wasted time. Whereas, for the no-AI-allowed tasks, I just sat down, started the screen recording, and coded with no distractions on a codebase I knew well, on tasks I’d pre-judged to be relatively small.

Sometimes, tasks with AI felt more engaging than they would have been otherwise. This was especially the case for repetitive ones like writing lots of similar tests or classes. Making the tasks into an interactive game, where I try to get the agent to do all the work with minimal manual intervention, was more fun than churning out very similar code over and over. But I don’t think it was faster.

A big productivity drag is that these agents were still not smart enough, at least out of the box. I mentioned some of the specific pain points above, but others come up over and over in my implementation reports. They weren’t able to coordinate across multiple repositories. They needed careful review to avoid inelegant code. They got stuck in loops doing simple things like fixing linter errors or lexicographically sorting filenames. They couldn’t traverse directories to find a relevant-looking file in any reasonable amount of time.

These sorts of things are all fixable, with enough scaffolding. And I am eager for the companies working on these to drill into such problem cases and build out the necessary tools. But until then, I suspect attempts to pair-program with the AI in a large project like this will need more constant handholding and continuous awareness of the models’ limitations.

It’s also likely that by investing more time upfront, individual developers can wrangle today’s tools into more productive forms. I did not commit any Cursor Rules for jsdom, or write any custom MCP servers. My intuition was that paying the automation tax would not be worth the time. And I think that judgment was likely accurate, for these nine AI-allowed issues that I was trying to fit into a few weekends. But if I were able to use AI agents effectively in my day job, I think the balance would flip, and I could become one of those people from X obsessed with finding the best rules and tools.

The more promising approach, though, is abandoning the pair-programming mode in favor of the parallel-agents mode. These days, if I were shooting for maximum productivity on these sorts of issues, I would spend a lot of up-front time writing detailed issue descriptions, including specific implementation suggestions. Then I would run all nine of the AI-allowed tasks in parallel, using something like Claude Code or OpenAI Codex. If one of the agents got stuck in a loop on linter errors, or took thirty minutes to traverse the directory tree to find the right tests to enable, it wouldn’t matter, because I’d be busy reviewing the other agents’ code, cycling through the process of helping them all along until everything was done.

I still think large, existing open-source codebases with established patterns face a more unique challenge than when you create something from scratch and can focus entirely on the quality of the end product. (Indeed, since the study, I’ve had a couple of occasions to create such mini-projects.) But through a combination of base model upgrades, improved scaffolding, and better training data, we’ll get there. Human beings’ time writing code is limited.