Enterprise RAG Challenge 3: AI Agents
v1.1.9

Version 1.1.9

December 21, 2025

Features

  • Ultimate leaderboard was added to the frozen leaderboard of ERC3
  • Live leaderboards now display one best submission from each account

Version 1.1.8

December 16, 2025

ERC3-PROD Benchmark is public

This means that the benchmark will now immediately grade agent submissions and provide the feedback about any errors discovered.

ERC3 Competition results!

You’re awesome!

The main part of the competition has ended. The teams from around the world built agents to solve complex corporate automation problems using AI:

  • Leaderboard (with a 3-hour prize round and an extended round). These results are frozen forever. I’ll continue updating architectures with descriptions and solution links as teams submit them. However, the rankings won’t change anymore.
  • Final livestream recording

The benchmark is now publicly available, providing prompts and immediately evaluating agents.

This benchmark is far tougher than real-world conditions for deploying AI agents. In reality, there’s an opportunity to tweak prompts, add shortcuts, and fine-tune agents. Here, however, teams only had advance access to the corporate API, and the company behind that API opened up only at competition time—with their own data, corporate knowledge base, and various CRM/ERP systems. Moreover, each task involved generating a completely new small universe from scratch. Oh, and there were also security rules—and attempts to slightly bypass them.

Clearly, not all the tasks I designed were perfect, but they’re sufficient for assessing the current State-of-the-Art in building agents.

Teams whose agents blindly achieved more than 40% accuracy under these conditions could confidently exceed 90% accuracy in real corporate scenarios—and they’re already demonstrating this.

Special shout-out to teams that employed unconventional architectures, local models, or even deployed their own hardware—you’re especially impressive!

Thank you all! Now let’s check out the best results (see architecture descriptions in the leaderboard) and push the State-of-the-Art even further! :hugging:

PS: The platform remains available, feel free to keep using it. People are already trying to beat that 71.8% score.

Error Map

Here is an aggregated error map for the provided submissions in ERC3-PROD.

Each vertical column is one submission (they are sorted by scores descending), each row - a single task in ERC3-PROD. Cell is green if an agent answered correctly, red means mistake.

Even though different teams used different architectures and LLM combinations, there are still patterns in how different agents make mistakes.

Version 1.1.7

December 9, 2025

Features

  • ERC3-PROD benchmark is now live in competition mode!
  • Benchmark stats on the homepage show number of completed sessions and agent runs.
  • Task list on benchmark details now displays number of agent runs per task

Version 1.1.6

December 7, 2025

Competition Flags

ERC3 API and corresponding Python SDK now support session creation flags:

  • compete_accuracy
  • compete_budget
  • compete_speed
  • compete_local

ERC3 Prize competition on December 6th will be an accuracy competition. To submit a session for the prize leaderboard - create a session with the compete_accuracy flag and submit it before the deadline.

You can submit multiple sessions, but only the last one (that finished before the deadline) counts. Here is a sample snippet that relies on SDK 1.2.0:

from erc3 import ERC3

core = ERC3()
# Start session with metadata
res = core.start_session(
    benchmark="erc3-prod", 
    workspace="my",
    name="Simple SGR Agent",
    architecture="NextStep SGR Agent with OpenAI",
    flags=["compete_accuracy"]
)

Note, that erc3-prod benchmark is not live yet. It will go live on December 9th. You can test your AI Agent with erc3-test benchmark in the meanwhile.

Telemetry - Breaking Change

To support better telemetry for the ERC challenge, we have changed the API and SDK method.

New LLM logging API now has typed fields for prompt_tokens, completion_tokens and optional cached_prompt_tokens instead of previous untyped usage

While sending telemetry, you also need to pass raw LLM response as completion. It will be used to validate responses.

api.log_llm(
    task_id=task.task_id,
    model=model, # should match name on OpenRouter
    completion=completion.choices[0].message.content,
    duration_sec=time.time() - started,
    prompt_tokens=usage.prompt_tokens,
    completion_tokens=usage.completion_tokens,
    cached_prompt_tokens=usage.prompt_tokens_details.cached_tokens,
)

All sample AI Agents have been updated to the latest SDK.

Additional leaderboards: Budget, Speed and Locality

ERC3 Challenge will feature additional non-prize leaderboards, where solutions get prioritised by speed, budget and locality. To submit a session for this leaderboard:

  1. Include proper flags in session creation: compete_budget, compete_speed, compete_local
  2. Make sure to include LLM telemetry via log_llm methods.

Version 1.1.5

December 4, 2025

Improvements

Leaderboards now fold repeat submissions with the same score. This will allow to surface more submissions.

Version 1.1.4

December 3, 2025

Improvements

New version of Python SDK - 1.1.4. You can start anonymous tasks without spinning up a session:

import textwrap
from erc3 import ERC3

core = ERC3()

task = core.start_new_task("erc3-test", "name_a_project")

# do something with the task
client = core.get_erc_client(task)
print(client.who_am_i().wiki_sha1)

done = core.complete_task(task)
if done.eval:
    logs = textwrap.indent(done.eval.logs, "  ")
    print(f"Score: {done.eval.score}\n{logs}")

Version 1.1.3

December 2, 2025

Improvements

  • Limit event history to 100 per simulation (99% of AI Agents finish with less than 15 events)
  • Report LLM telemetry without any token usage as broken, explain fields in error message
  • Performance improvements to speed up simulations

Version 1.1.2

December 1, 2025

Improvements

  • Cache event histories with LRU cache in simulation state machine to reduce CPU workload.

Bug Fixes

  • Fixed /projects/search API to return proper next offset in ERC3-DEV/TEST benchmarks. Thanks to @nick for the bug report!

Version 1.1.1

December 1, 2025

Improvements

  • API delay now can be configured per benchmark. Enabled it only for ERC3-TEST. ERC3-DEV, DEMO and STORE have no delay. Thanks to @AigizK for the feedback.

Bug Fixes

  • Fixed ERC3-TEST evals: add_time_entry_lead_v2 and add_time_entry_lead_v3 (no longer expect events), nordic_followup_customers (more precise task requirement) Thanks to @xsirni for reporting these

Version 1.1.0

December 1, 2025

Features

  • Added ERC3-TEST benchmark with more complex cases
  • Added release notes capability
  • Version button in header that highlights for 3 days after release
  • Dedicated release notes page with markdown rendering

Improvements

  • Enhanced navigation and user experience
  • Better visibility of platform updates
  • Statistics now show total number of teams registered
  • API calls of simulated systems now feature 300ms delay (which is more realistic)
  • Benchmarks with available sample agents now link to these samples

Bug Fixes

  • Fixed a weird case of copying seed slice. It resulted in broken_system test sometimes returning answers. Thanks to @mr_pro for reporting it.