InsighthubNews
  • Home
  • World News
  • Politics
  • Celebrity
  • Environment
  • Business
  • Technology
  • Crypto
  • Sports
  • Gaming
Reading: How good is Real Research’s AI agent? In the deep search bench report
Share
Font ResizerAa
InsighthubNewsInsighthubNews
Search
  • Home
  • World News
  • Politics
  • Celebrity
  • Environment
  • Business
  • Technology
  • Crypto
  • Sports
  • Gaming
© 2024 All Rights Reserved | Powered by Insighthub News
InsighthubNews > Technology > How good is Real Research’s AI agent? In the deep search bench report
Technology

How good is Real Research’s AI agent? In the deep search bench report

June 3, 2025 10 Min Read
Share
mm
SHARE

As large-scale language models (LLMs) evolve rapidly, their promises are also strong research assistants. More and more, they don’t just answer simple factual questions. They work on multi-step inference, assessment of conflicting information, procurement of data across the web, and “deep research” tasks that combine it into a coherent output.

This new feature is currently being sold under a variety of brand names by major labs. Openai calls it “deep search”, humanity is called “extended thinking”, Google’s Gemini offers the “Search + Pro” feature, and Prperxity labels “Pro Search” or “Deep Research.” But how effective are these products actually? A new report from Futuresearch evaluates Deep Research Bench (DRB): Web Research Agent, providing the most rigorous rating to date, with results revealing both impressive capabilities and important shortcomings.

What is a deep search bench?

Created by the Futuresearch team, Deep Research Bench is a meticulously built benchmark designed to assess the performance of AI agents in multi-step, web-based research tasks. These are not simple questions with simple answers. They reflect the troublesome and freeing challenges faced by analysts, policymakers, and researchers in real-world settings.

The benchmark includes 89 different tasks across eight categories, including:

  • Find the numberExample: “How many recalls have you ever had for FDA Class II medical devices?”
  • Verify the claimExample: “Is ChatGpt 10 times more energy-intensive than Google search?”
  • Compile the datasetExample: “Trends in the work of US software developers from 2019 to 2023”

Each task type is carefully constructed with human-validated answers and evaluated using a frozen dataset of scraped web pages known as Retrosearch. This ensures consistency across model evaluations, avoiding live web variability.

Agent Architecture: React and Retrosearch

At the heart of the deep research bench is React Architecture, short for “Reason + Act.” This method mimics the way human researchers tackle problems. Think through tasks, perform actions such as performing web searches, observe the results, then decide whether to iterate or end.

While previous models explicitly follow this loop, new “thinking” models often streamline processes and often embed inference into actions in a more fluid manner. To ensure consistency across evaluations, DRB introduces Retrosearch, a custom built version of the web. Rather than relying on the ever-changing live internet, agents utilize curated archives of web pages that have been scraped using tools such as Serper, Playwright, and Scraperapi. The scale is impressive. For high multiple tasks such as “collect evidence,” Retrosearch has access to over 189,000 pages, all frozen in time, ensuring a fair and replicable testing environment.

See also  A new research paper questions the price of "tokens" in AI chat

Which AI agents perform the most?

Of all the candidates, Openai’s O3 emerged as a top performer, earning 0.51 out of the possible 1.0 on the deep search bench. While that may sound modest, it is important to understand the difficulties of benchmarking. Due to the ambiguity of task definition and scoring, even the perfect agent could raise what researchers call “noise ceiling” to about 0.8. In other words, even today’s best models still fall short of the informed, systematic human researchers.

Still, the leaderboard clearly provides insight. Not only did the O3 lead the pack, it did it with speed and consistency, showing strong performance on almost every task type. The Claude 3.7 sonnet from humanity continued in close proximity, showing versatility in both its “thinking” and “non-thinking” modes. The Gemini 2.5 Pro Google’s flagship model stood out for its ability to handle tasks that required structured planning and step-by-step inference. Meanwhile, the Open-Weight Deepseek-R1 maintained its pace with the GPT-4 turbo, bringing a comfortable surprise narrowing down the performance gap between the open and closed models.

A clear pattern appeared all over the place. The newer “thinking response” model consistently outperformed its previous counterparts, while the closed model maintained a more pronounced edge than the open weight alternative.

Where do agents struggle?

I found reading the breakdown patterns highlighted in the Deep Research Bench Report surprisingly familiar. One of the most frustrating aspects I personally encountered is when AI agents simply forget what we’re doing, especially during long research and content creation sessions. As the context window grows, the model often starts to lose threads. Key details fade, the goal becomes confused, and suddenly the response feels disjointed or undesired. At some point, I learned that it’s better to reduce losses and start from scratch, even if it means throwing away everything that’s been generated so far.

See also  The rise of AI in scientific discovery: Can AI really really think outside the box?

Such forgetfulness is not mere anecdote, but the most important predictor of failure in the assessment of deep research benches. But that’s not the only recurring problem. The report also highlights how some models fall into repeated use of tools, running the same search over and over again, as if they were stuck in a loop. Others show that queries are insufficiently crafted, rather than thinking critically about how to search effectively. And too often, agents fall victim to premature conclusions. Technically, it checks the box, but it guides half-formed answers that are not in the real insight.

Even among the top models, the difference is tough. For example, the GPT-4 turbo showed a prominent tendency to forget about previous steps, while the Deepseek-R1 was more likely to hallucinate or invent plausible (incorrectly) information. Overall, the model frequently failed to cross-check sources and validate findings before finalizing the output. For those who rely on AI for serious work, these issues seem too familiar. And they emphasize how far they still have to go by building agents that can truly think and study like humans.

How about memory-based performance?

Interestingly, Deep Research Bench evaluated what is called a “Toowless” agent that works without access to external tools, such as web searching and document retrieval. These agents rely entirely on internal training data and memory, and generate answers based solely on what they learned previously during training. In reality, this means they can’t look into anything or check the information. They speculate based on what they “remember.”

See also  How Google's AI unlocks the secrets of dolphin communication

Surprisingly, these Toolless agents were performed in much the same way as complete research agents in a particular task. For example, the validation claim task, where the goal is to assess the validity of the statement, achieved a 0.61, almost coinciding with the average 0.62 for tool-enabled agents. This suggests that models such as O3 and Claude often exist in strong internal pre-existence and can recognize the truthfulness of common claims without the need to search the web.

However, on the more demanding tasks, where multiple values ​​need to be connected from different sources, they have fallen completely apart in the more demanding tasks where evidence relies on deriving numbers that need to be linked to finding and assessing diverse facts in context. Without fresh information or real-time search capabilities, there was a lack of means to produce accurate or comprehensive answers.

This contrast emphasizes an important nuance. Although LLM today can “know” many things, deep research relies on inferences with up-to-date, verifiable information, not just recalls.

Final Thoughts

The DRB report reveals one thing. Today’s AI agents can outperform the average person on narrowly defined tasks, but are still behind skilled generalist researchers, especially when it comes to strategic planning, adapting intermediate processing, and inference.

This gap becomes particularly evident in long or complex sessions. I have experienced it firsthand. Agents gradually track task objectives, leading to an annoying disruption of consistency and usefulness.

What makes the deep search bench so valuable is that it not only tests surface-level knowledge, but also explores the intersections of tool use, memory, inference and adaptation, and is closely similar to actual research than benchmarks such as MMLU and GSM8K.

As LLMS continues to be integrated into serious knowledge tasks, FutureSearch tools such as DRB are essential to assess not only what these systems know but how well they actually work.

Share This Article
Twitter Copy Link
Previous Article Understand Helpdesk fraud and how to protect your organization Understand Helpdesk fraud and how to protect your organization
Next Article The United Nations, Iran and Egypt gather together to discuss Iran's nuclear program so that enrichment continues The United Nations, Iran and Egypt gather together to discuss Iran’s nuclear program so that enrichment continues

Latest News

iPhone Spyware, Microsoft 0-Day, Tokenbreak Hack, AI Data Leaks, etc.

iPhone Spyware, Microsoft 0-Day, Tokenbreak Hack, AI Data Leaks, etc.

Some of the biggest security issues start quietly. There are…

June 16, 2025
mm

Why LLMS is thinking too much about simple puzzles, but give up on hard puzzles

Artificial intelligence has made incredible advances with large-scale language models…

June 15, 2025
JSFireTruck JavaScript Malware

Over 269,000 websites infected with JSFiretruck JavaScript malware

Cybersecurity researchers are paying attention to "large campaigns" that undermine…

June 15, 2025
You need to know what features you need with 6 new ChatGPT projects

You need to know what features you need with 6 new ChatGPT projects

The ChatGPT project has just received the most significant update…

June 14, 2025
AsyncRAT and Skuld Stealer

Discord Invite Link Hijacking offers Asyncrat and Skuld Stealer targeted at crypto wallets

The new malware campaign is taking advantage of the weaknesses…

June 14, 2025

You Might Also Like

mm
Technology

Why Waabi’s AI-driven virtual trucks are the future of autonomous driving technology

10 Min Read
mm
Technology

How Model Context Protocol (MCP) standardizes AI connections with tools and data

10 Min Read
Android System Flaw in May 2025 Security Update
Technology

Google fixes Android flaws (CVE-2025-27363) exploited by attackers

2 Min Read
WordPress Plugin Vulnerability
Technology

ottokit WordPress Plugin Administrator Creation Vulnerability Vulnerability

3 Min Read
InsighthubNews
InsighthubNews

Welcome to InsighthubNews, your reliable source for the latest updates and in-depth insights from around the globe. We are dedicated to bringing you up-to-the-minute news and analysis on the most pressing issues and developments shaping the world today.

  • Home
  • Celebrity
  • Environment
  • Business
  • Crypto
  • Home
  • World News
  • Politics
  • Celebrity
  • Environment
  • Business
  • Technology
  • Crypto
  • Sports
  • Gaming
  • World News
  • Politics
  • Technology
  • Sports
  • Gaming
  • About us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms of Service
  • About us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms of Service

© 2024 All Rights Reserved | Powered by Insighthub News

Welcome Back!

Sign in to your account

Lost your password?