Determine 1. Copilot Enviornment is a VSCode extension that collects human preferences of code immediately from builders.
As mannequin capabilities enhance, giant language fashions (LLMs) are more and more built-in into consumer environments and workflows. Particularly, software program builders code with LLM-powered instruments in built-in improvement environments comparable to VS Code, IntelliJ, or Eclipse. Whereas these instruments are more and more utilized in apply, present LLM evaluations wrestle to seize how customers work together with these instruments in actual environments, as they’re usually restricted to quick consumer research, solely take into account easy programming duties versus real-world techniques, or depend on web-based platforms faraway from improvement environments.
To handle these limitations, we introduce Copilot Enviornment, an app designed to guage LLMs in real-world settings by accumulating preferences immediately in a developer’s precise workflow. Copilot Enviornment is a Visible Studio Code extension that gives builders with code completions, akin to the kind of help supplied by GitHub Copilot. To date, over 11,000 customers have downloaded Copilot Enviornment, and the device has served over 100K completions, and gathered over 25,000 code completion battles. The battles kind a dwell leaderboard on the LMArena web site. Since its launch, Copilot Enviornment has additionally been used to guage two new code completion fashions previous to their launch: a brand new Codestral mannequin from Mistral AI and Mercury Coder from InceptionAI.
On this weblog publish, we talk about how we designed and deployed Copilot Enviornment. We additionally spotlight how Copilot Enviornment gives new insights into developer code preferences.
Copilot Enviornment System Design
To gather consumer preferences, Copilot Enviornment presents a novel interface that reveals customers paired code completions from two totally different LLMs, that are decided primarily based on a sampling technique that mitigates latency whereas preserving protection throughout mannequin comparisons. Moreover, we devise a prompting scheme that permits a various set of fashions to carry out code completions with excessive constancy. Determine 1 overviews this workflow. We are going to overview every element under:
Person Interface: Copilot Enviornment permits customers to pick between pairs of code completions from totally different LLMs. Person choices enable us to higher perceive developer preferences between LLMs. To keep away from interrupting consumer workflows, voting is designed to be seamless—customers use keyboard shortcuts to rapidly settle for code completions.
Sampling mannequin pairs: We discover a sampling technique to attenuate the skilled latency. Since our interface reveals two code completions collectively, the slowest completion determines the latency. We seize every mannequin’s latency as a log-normal distribution and tune a temperature parameter to interpolate between a latency-optimized distribution and a uniform distribution, observing a lower in median skilled latency by 33% (from 1.61 to 1.07 seconds) in comparison with a uniform distribution.

Prompting for code completions: Throughout improvement, fashions have to “fill within the center”, the place code must be generated primarily based on each the present prefix and suffix. Whereas some fashions, comparable to DeepSeek and Codestral, are designed to fill within the center, many chat fashions usually are not and require further prompting. To perform this, we enable the mannequin to generate code snippets, which is a extra pure format, after which post-process them right into a FiM completion. Our strategy is as follows: along with the identical immediate templates above, the fashions are supplied with directions to start by re-outputting a portion of the prefix and equally finish with a portion of the suffix. We then match parts of the output code within the enter and delete the repeated code. This straightforward prompting trick permits chat fashions to carry out code completions with excessive success (Determine 2).
Deployment

We deploy Copilot Enviornment as a free extension out there on the VSCode extension retailer. Throughout deployment, we log consumer judgments and latency for mannequin responses, together with the consumer’s enter and completion. Given the delicate nature of programming, customers can prohibit our entry to their knowledge. Relying on privateness settings, we additionally accumulate the consumer’s code context and mannequin responses.
As is commonplace in different work on pairwise choice analysis (e.g., Chatbot Enviornment), we apply a Bradley-Terry (BT) mannequin to estimate the relative strengths of every mannequin. We bootstrap the battles within the BT calculation to assemble a 95% confidence interval for the rankings, that are used to create a leaderboard that ranks all fashions, the place every mannequin’s rank is decided by which different fashions’ decrease bounds fall under its higher sure. We host a dwell leadboard of mannequin rankings at lmarena.ai (Determine 3).
Findings

Comparability to prior datasets
We evaluate our leaderboard to current evaluations, which embody each dwell choice leaderboards with human suggestions and static benchmarks (Determine 4). The static benchmarks we evaluate towards are LiveBench, BigCodeBench, and LiveCodeBench, which consider fashions’ code technology skills on quite a lot of Python duties and proceed to be maintained with new mannequin releases. We additionally evaluate to Chatbot Enviornment and their coding-specific subset, that are human preferences of chat responses collected by means of an online platform.
We discover a low correlation (r ≤ 0.1) with most static benchmarks, however a comparatively increased correlation (Spearman’s rank correlation (r) of 0.62) with Chatbot Enviornment (coding) and an analogous correlation (r = 0.48) with Chatbot Enviornment (normal). The stronger correlation with human choice evaluations in comparison with static benchmarks probably signifies that human suggestions captures distinct points of mannequin efficiency that static benchmarks fail to measure. We discover that smaller fashions are likely to overperform (e.g., GPT-4o mini and Qwen-2.5-Coder 32B), significantly in static benchmarks. We attribute these variations to the distinctive distribution of knowledge and duties that Copilot Enviornment evaluates over, which we discover in additional element subsequent.

Compared to prior approaches, evaluating fashions in actual consumer workflows results in a various knowledge distribution when it comes to programming and pure languages, duties, and code constructions (Determine 5):
- Programming and pure language: Whereas the plurality of Copilot Enviornment customers write in English (36%) and Python (49%), we additionally establish 24 totally different pure languages and 103 programming languages which is similar to Chatbot Enviornment (normal) and benchmarks centered on multilingual technology. In distinction, static benchmarks are likely to deal with questions written solely in Python and English.
- Downstream duties: Present benchmarks are likely to supply issues from coding competitions, handwritten programming challenges, or from a curated set of GitHub repositories. In distinction, Copilot Enviornment customers are engaged on a various set of lifelike duties, together with however not restricted to frontend elements, backend logic, and ML pipelines.
- Code constructions and context lengths: Most coding benchmarks observe particular constructions, which signifies that most benchmarks have comparatively quick context lengths. Equally, Chatbot Enviornment focuses on pure language enter collected from chat conversations, with many prompts not together with any code context (e.g., 40% of Chatbot Enviornment’s coding duties include code context and solely 2.6% deal with infilling). In contrast to any current analysis, Copilot Enviornment is structurally various with considerably longer inputs.
Insights into consumer preferences
- Downstream duties considerably have an effect on win price, whereas programming languages have little impact: Altering activity kind considerably impacts relative mannequin efficiency, which can point out that sure fashions are overexposed to competition-style algorithmic coding issues. Alternatively, the impact of the programming language on win-rates was remarkably small, that means that fashions that carry out properly on Python will probably carry out properly on one other language. We hypothesize that that is due to the inherent similarities between programming languages, and studying one improves efficiency in one other, aligning with developments reported in prior work.
- Smaller fashions could overfit to knowledge much like static benchmarks, whereas the efficiency of bigger fashions is combined: Present benchmarks (e.g., these in Determine 4) primarily consider fashions on Python algorithmic issues with quick context. Nevertheless, we discover that Qwen-2.5 Coder performs noticeably worse on frontend/backend duties, longer contexts, and non-Python settings. We observe comparable developments for the 2 different small fashions (Gemini Flash and GPT-4o mini). We hypothesize that overexposure could also be significantly problematic for smaller fashions. Alternatively, efficiency amongst bigger fashions is combined.
Conclusion
Whereas Copilot Enviornment represents a shift in the suitable route for LLM analysis, offering extra grounded and lifelike evaluations, there may be nonetheless important work to be achieved to totally symbolize all developer workflows. For instance, extending Copilot Enviornment to account for interface variations from manufacturing instruments like GitHub Copilot and tackling privateness issues that restrict knowledge sharing. Regardless of these constraints, our platform reveals that evaluating coding LLMs in lifelike environments yields rankings considerably totally different from static benchmarks or chat-based evaluations and highlights the significance of testing AI assistants with actual customers on actual duties. We’ve open-sourced Copilot Enviornment to encourage the open supply group to incorporate extra nuanced suggestions mechanisms, code trajectory metrics, and extra interplay modes.
Should you suppose this weblog publish is helpful to your work, please take into account citing it.
@misc{chi2025copilotarenaplatformcode,
title={Copilot Enviornment: A Platform for Code LLM Analysis within the Wild},
creator={Wayne Chi and Valerie Chen and Anastasios Nikolas Angelopoulos and Wei-Lin Chiang and Aditya Mittal and Naman Jain and Tianjun Zhang and Ion Stoica and Chris Donahue and Ameet Talwalkar},
12 months={2025},
eprint={2502.09328},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2502.09328},
}
Determine 1. Copilot Enviornment is a VSCode extension that collects human preferences of code immediately from builders.
As mannequin capabilities enhance, giant language fashions (LLMs) are more and more built-in into consumer environments and workflows. Particularly, software program builders code with LLM-powered instruments in built-in improvement environments comparable to VS Code, IntelliJ, or Eclipse. Whereas these instruments are more and more utilized in apply, present LLM evaluations wrestle to seize how customers work together with these instruments in actual environments, as they’re usually restricted to quick consumer research, solely take into account easy programming duties versus real-world techniques, or depend on web-based platforms faraway from improvement environments.
To handle these limitations, we introduce Copilot Enviornment, an app designed to guage LLMs in real-world settings by accumulating preferences immediately in a developer’s precise workflow. Copilot Enviornment is a Visible Studio Code extension that gives builders with code completions, akin to the kind of help supplied by GitHub Copilot. To date, over 11,000 customers have downloaded Copilot Enviornment, and the device has served over 100K completions, and gathered over 25,000 code completion battles. The battles kind a dwell leaderboard on the LMArena web site. Since its launch, Copilot Enviornment has additionally been used to guage two new code completion fashions previous to their launch: a brand new Codestral mannequin from Mistral AI and Mercury Coder from InceptionAI.
On this weblog publish, we talk about how we designed and deployed Copilot Enviornment. We additionally spotlight how Copilot Enviornment gives new insights into developer code preferences.
Copilot Enviornment System Design
To gather consumer preferences, Copilot Enviornment presents a novel interface that reveals customers paired code completions from two totally different LLMs, that are decided primarily based on a sampling technique that mitigates latency whereas preserving protection throughout mannequin comparisons. Moreover, we devise a prompting scheme that permits a various set of fashions to carry out code completions with excessive constancy. Determine 1 overviews this workflow. We are going to overview every element under:
Person Interface: Copilot Enviornment permits customers to pick between pairs of code completions from totally different LLMs. Person choices enable us to higher perceive developer preferences between LLMs. To keep away from interrupting consumer workflows, voting is designed to be seamless—customers use keyboard shortcuts to rapidly settle for code completions.
Sampling mannequin pairs: We discover a sampling technique to attenuate the skilled latency. Since our interface reveals two code completions collectively, the slowest completion determines the latency. We seize every mannequin’s latency as a log-normal distribution and tune a temperature parameter to interpolate between a latency-optimized distribution and a uniform distribution, observing a lower in median skilled latency by 33% (from 1.61 to 1.07 seconds) in comparison with a uniform distribution.

Prompting for code completions: Throughout improvement, fashions have to “fill within the center”, the place code must be generated primarily based on each the present prefix and suffix. Whereas some fashions, comparable to DeepSeek and Codestral, are designed to fill within the center, many chat fashions usually are not and require further prompting. To perform this, we enable the mannequin to generate code snippets, which is a extra pure format, after which post-process them right into a FiM completion. Our strategy is as follows: along with the identical immediate templates above, the fashions are supplied with directions to start by re-outputting a portion of the prefix and equally finish with a portion of the suffix. We then match parts of the output code within the enter and delete the repeated code. This straightforward prompting trick permits chat fashions to carry out code completions with excessive success (Determine 2).
Deployment

We deploy Copilot Enviornment as a free extension out there on the VSCode extension retailer. Throughout deployment, we log consumer judgments and latency for mannequin responses, together with the consumer’s enter and completion. Given the delicate nature of programming, customers can prohibit our entry to their knowledge. Relying on privateness settings, we additionally accumulate the consumer’s code context and mannequin responses.
As is commonplace in different work on pairwise choice analysis (e.g., Chatbot Enviornment), we apply a Bradley-Terry (BT) mannequin to estimate the relative strengths of every mannequin. We bootstrap the battles within the BT calculation to assemble a 95% confidence interval for the rankings, that are used to create a leaderboard that ranks all fashions, the place every mannequin’s rank is decided by which different fashions’ decrease bounds fall under its higher sure. We host a dwell leadboard of mannequin rankings at lmarena.ai (Determine 3).
Findings

Comparability to prior datasets
We evaluate our leaderboard to current evaluations, which embody each dwell choice leaderboards with human suggestions and static benchmarks (Determine 4). The static benchmarks we evaluate towards are LiveBench, BigCodeBench, and LiveCodeBench, which consider fashions’ code technology skills on quite a lot of Python duties and proceed to be maintained with new mannequin releases. We additionally evaluate to Chatbot Enviornment and their coding-specific subset, that are human preferences of chat responses collected by means of an online platform.
We discover a low correlation (r ≤ 0.1) with most static benchmarks, however a comparatively increased correlation (Spearman’s rank correlation (r) of 0.62) with Chatbot Enviornment (coding) and an analogous correlation (r = 0.48) with Chatbot Enviornment (normal). The stronger correlation with human choice evaluations in comparison with static benchmarks probably signifies that human suggestions captures distinct points of mannequin efficiency that static benchmarks fail to measure. We discover that smaller fashions are likely to overperform (e.g., GPT-4o mini and Qwen-2.5-Coder 32B), significantly in static benchmarks. We attribute these variations to the distinctive distribution of knowledge and duties that Copilot Enviornment evaluates over, which we discover in additional element subsequent.

Compared to prior approaches, evaluating fashions in actual consumer workflows results in a various knowledge distribution when it comes to programming and pure languages, duties, and code constructions (Determine 5):
- Programming and pure language: Whereas the plurality of Copilot Enviornment customers write in English (36%) and Python (49%), we additionally establish 24 totally different pure languages and 103 programming languages which is similar to Chatbot Enviornment (normal) and benchmarks centered on multilingual technology. In distinction, static benchmarks are likely to deal with questions written solely in Python and English.
- Downstream duties: Present benchmarks are likely to supply issues from coding competitions, handwritten programming challenges, or from a curated set of GitHub repositories. In distinction, Copilot Enviornment customers are engaged on a various set of lifelike duties, together with however not restricted to frontend elements, backend logic, and ML pipelines.
- Code constructions and context lengths: Most coding benchmarks observe particular constructions, which signifies that most benchmarks have comparatively quick context lengths. Equally, Chatbot Enviornment focuses on pure language enter collected from chat conversations, with many prompts not together with any code context (e.g., 40% of Chatbot Enviornment’s coding duties include code context and solely 2.6% deal with infilling). In contrast to any current analysis, Copilot Enviornment is structurally various with considerably longer inputs.
Insights into consumer preferences
- Downstream duties considerably have an effect on win price, whereas programming languages have little impact: Altering activity kind considerably impacts relative mannequin efficiency, which can point out that sure fashions are overexposed to competition-style algorithmic coding issues. Alternatively, the impact of the programming language on win-rates was remarkably small, that means that fashions that carry out properly on Python will probably carry out properly on one other language. We hypothesize that that is due to the inherent similarities between programming languages, and studying one improves efficiency in one other, aligning with developments reported in prior work.
- Smaller fashions could overfit to knowledge much like static benchmarks, whereas the efficiency of bigger fashions is combined: Present benchmarks (e.g., these in Determine 4) primarily consider fashions on Python algorithmic issues with quick context. Nevertheless, we discover that Qwen-2.5 Coder performs noticeably worse on frontend/backend duties, longer contexts, and non-Python settings. We observe comparable developments for the 2 different small fashions (Gemini Flash and GPT-4o mini). We hypothesize that overexposure could also be significantly problematic for smaller fashions. Alternatively, efficiency amongst bigger fashions is combined.
Conclusion
Whereas Copilot Enviornment represents a shift in the suitable route for LLM analysis, offering extra grounded and lifelike evaluations, there may be nonetheless important work to be achieved to totally symbolize all developer workflows. For instance, extending Copilot Enviornment to account for interface variations from manufacturing instruments like GitHub Copilot and tackling privateness issues that restrict knowledge sharing. Regardless of these constraints, our platform reveals that evaluating coding LLMs in lifelike environments yields rankings considerably totally different from static benchmarks or chat-based evaluations and highlights the significance of testing AI assistants with actual customers on actual duties. We’ve open-sourced Copilot Enviornment to encourage the open supply group to incorporate extra nuanced suggestions mechanisms, code trajectory metrics, and extra interplay modes.
Should you suppose this weblog publish is helpful to your work, please take into account citing it.
@misc{chi2025copilotarenaplatformcode,
title={Copilot Enviornment: A Platform for Code LLM Analysis within the Wild},
creator={Wayne Chi and Valerie Chen and Anastasios Nikolas Angelopoulos and Wei-Lin Chiang and Aditya Mittal and Naman Jain and Tianjun Zhang and Ion Stoica and Chris Donahue and Ameet Talwalkar},
12 months={2025},
eprint={2502.09328},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2502.09328},
}