View all platform updates and version history
This means that the benchmark will now immediately grade agent submissions and provide the feedback about any errors discovered.
You’re awesome!
The main part of the competition has ended. The teams from around the world built agents to solve complex corporate automation problems using AI:
The benchmark is now publicly available, providing prompts and immediately evaluating agents.
This benchmark is far tougher than real-world conditions for deploying AI agents. In reality, there’s an opportunity to tweak prompts, add shortcuts, and fine-tune agents. Here, however, teams only had advance access to the corporate API, and the company behind that API opened up only at competition time—with their own data, corporate knowledge base, and various CRM/ERP systems. Moreover, each task involved generating a completely new small universe from scratch. Oh, and there were also security rules—and attempts to slightly bypass them.
Clearly, not all the tasks I designed were perfect, but they’re sufficient for assessing the current State-of-the-Art in building agents.
Teams whose agents blindly achieved more than 40% accuracy under these conditions could confidently exceed 90% accuracy in real corporate scenarios—and they’re already demonstrating this.
Special shout-out to teams that employed unconventional architectures, local models, or even deployed their own hardware—you’re especially impressive!
Thank you all! Now let’s check out the best results (see architecture descriptions in the leaderboard) and push the State-of-the-Art even further! :hugging:
PS: The platform remains available, feel free to keep using it. People are already trying to beat that 71.8% score.
Here is an aggregated error map for the provided submissions in ERC3-PROD.
Each vertical column is one submission (they are sorted by scores descending), each row - a single task in ERC3-PROD. Cell is green if an agent answered correctly, red means mistake.
Even though different teams used different architectures and LLM combinations, there are still patterns in how different agents make mistakes.

ERC3 API and corresponding Python SDK now support session creation flags:
compete_accuracycompete_budgetcompete_speedcompete_localERC3 Prize competition on December 6th will be an accuracy competition. To submit a session for the prize leaderboard - create a session with the compete_accuracy flag and submit it before the deadline.
You can submit multiple sessions, but only the last one (that finished before the deadline) counts. Here is a sample snippet that relies on SDK 1.2.0:
from erc3 import ERC3
core = ERC3()
# Start session with metadata
res = core.start_session(
benchmark="erc3-prod",
workspace="my",
name="Simple SGR Agent",
architecture="NextStep SGR Agent with OpenAI",
flags=["compete_accuracy"]
)
Note, that erc3-prod benchmark is not live yet. It will go live on December 9th. You can test your AI Agent with erc3-test benchmark in the meanwhile.
To support better telemetry for the ERC challenge, we have changed the API and SDK method.
New LLM logging API now has typed fields for prompt_tokens, completion_tokens and optional cached_prompt_tokens instead of previous untyped usage
While sending telemetry, you also need to pass raw LLM response as completion. It will be used to validate responses.
api.log_llm(
task_id=task.task_id,
model=model, # should match name on OpenRouter
completion=completion.choices[0].message.content,
duration_sec=time.time() - started,
prompt_tokens=usage.prompt_tokens,
completion_tokens=usage.completion_tokens,
cached_prompt_tokens=usage.prompt_tokens_details.cached_tokens,
)
All sample AI Agents have been updated to the latest SDK.
ERC3 Challenge will feature additional non-prize leaderboards, where solutions get prioritised by speed, budget and locality. To submit a session for this leaderboard:
compete_budget, compete_speed, compete_locallog_llm methods.Leaderboards now fold repeat submissions with the same score. This will allow to surface more submissions.

New version of Python SDK - 1.1.4. You can start anonymous tasks without spinning up a session:
import textwrap
from erc3 import ERC3
core = ERC3()
task = core.start_new_task("erc3-test", "name_a_project")
# do something with the task
client = core.get_erc_client(task)
print(client.who_am_i().wiki_sha1)
done = core.complete_task(task)
if done.eval:
logs = textwrap.indent(done.eval.logs, " ")
print(f"Score: {done.eval.score}\n{logs}")
/projects/search API to return proper next offset in ERC3-DEV/TEST benchmarks. Thanks to @nick for the bug report!@AigizK for the feedback.add_time_entry_lead_v2 and add_time_entry_lead_v3 (no longer expect events), nordic_followup_customers (more precise task requirement) Thanks to @xsirni for reporting thesebroken_system test sometimes returning answers. Thanks to @mr_pro for reporting it.