* . *
  • About
  • Advertise
  • Privacy & Policy
  • Contact
Friday, October 3, 2025
Earth-News
  • Home
  • Business
  • Entertainment
    Toni Braxton Is Turning Her Biggest Hits Into Lifetime Movies – Yahoo

    Toni Braxton Is Turning Her Biggest Hits Into Lifetime Movies – Yahoo

    Major airline to offer new in-flight entertainment options for passengers – PennLive.com

    Major airline to offer new in-flight entertainment options for passengers – PennLive.com

    Penn State-Themed Restaurant and Entertainment Spot Happy Valley Live Set to Open in State College – StateCollege.com

    Penn State-Themed Restaurant and Entertainment Spot Happy Valley Live Set to Open in State College – StateCollege.com

    The Police Made Chart History With This 1979 Hit Nearly 50 Years Ago – Yahoo

    How The Police Changed Music Forever with Their Iconic 1979 Hit Nearly 50 Years Ago

    Good Deed Entertainment Acquires Worldwide Rights To Liza Mandelup’s Documentary ‘Caterpillar’ – Deadline

    Good Deed Entertainment Lands Global Rights to Liza Mandelup’s Captivating Documentary ‘Caterpillar

    Danielle Fishel Explains Why Being on “DWTS” Makes Her Feel ‘Like It’s 1994 Again’ Filming “Boy Meets World” (Exclusive) – Yahoo

    Danielle Fishel Explains Why Being on “DWTS” Makes Her Feel ‘Like It’s 1994 Again’ Filming “Boy Meets World” (Exclusive) – Yahoo

  • General
  • Health
  • News

    Cracking the Code: Why China’s Economic Challenges Aren’t Shaking Markets, Unlike America’s” – Bloomberg

    Trump’s Narrow Window to Spread the Truth About Harris

    Trump’s Narrow Window to Spread the Truth About Harris

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Science
  • Sports
  • Technology
    Technology Is Becoming More Important Than Humans In CX – No Jitter

    Technology Is Becoming More Important Than Humans In CX – No Jitter

    A Tech Expo Shows What China Can Make, but Not Who’ll Buy It All – The New York Times

    Inside China’s Tech Expo: Cutting-Edge Innovations Face Uncertain Demand

    Steampunk Metal Oval Technology Sense Sunglasses Personality Handmade Chain Multicolor Sunglasses UV400 – The San Joaquin Valley Sun

    Steampunk Metal Oval Sunglasses with Handmade Multicolor Chain – Bold UV400 Protection and Unique Style

    STELLA Automotive AI Appoints Fred Seidelman as Chief Technology Officer – Yahoo Finance

    STELLA Automotive AI Appoints Fred Seidelman as New Chief Technology Officer

    Saving Energy and Money with Smart Technology – Terms of Service with Clare Duffy – Podcast on CNN Podcasts – CNN

    Saving Energy and Money with Smart Technology – Terms of Service with Clare Duffy – Podcast on CNN Podcasts – CNN

    Four Strategic Signals Technology Leaders Are Tuning In To – SPONSOR CONTENT FROM ARM – Harvard Business Review

    Four Essential Strategic Signals Every Technology Leader Should Watch

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
No Result
View All Result
  • Home
  • Business
  • Entertainment
    Toni Braxton Is Turning Her Biggest Hits Into Lifetime Movies – Yahoo

    Toni Braxton Is Turning Her Biggest Hits Into Lifetime Movies – Yahoo

    Major airline to offer new in-flight entertainment options for passengers – PennLive.com

    Major airline to offer new in-flight entertainment options for passengers – PennLive.com

    Penn State-Themed Restaurant and Entertainment Spot Happy Valley Live Set to Open in State College – StateCollege.com

    Penn State-Themed Restaurant and Entertainment Spot Happy Valley Live Set to Open in State College – StateCollege.com

    The Police Made Chart History With This 1979 Hit Nearly 50 Years Ago – Yahoo

    How The Police Changed Music Forever with Their Iconic 1979 Hit Nearly 50 Years Ago

    Good Deed Entertainment Acquires Worldwide Rights To Liza Mandelup’s Documentary ‘Caterpillar’ – Deadline

    Good Deed Entertainment Lands Global Rights to Liza Mandelup’s Captivating Documentary ‘Caterpillar

    Danielle Fishel Explains Why Being on “DWTS” Makes Her Feel ‘Like It’s 1994 Again’ Filming “Boy Meets World” (Exclusive) – Yahoo

    Danielle Fishel Explains Why Being on “DWTS” Makes Her Feel ‘Like It’s 1994 Again’ Filming “Boy Meets World” (Exclusive) – Yahoo

  • General
  • Health
  • News

    Cracking the Code: Why China’s Economic Challenges Aren’t Shaking Markets, Unlike America’s” – Bloomberg

    Trump’s Narrow Window to Spread the Truth About Harris

    Trump’s Narrow Window to Spread the Truth About Harris

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Science
  • Sports
  • Technology
    Technology Is Becoming More Important Than Humans In CX – No Jitter

    Technology Is Becoming More Important Than Humans In CX – No Jitter

    A Tech Expo Shows What China Can Make, but Not Who’ll Buy It All – The New York Times

    Inside China’s Tech Expo: Cutting-Edge Innovations Face Uncertain Demand

    Steampunk Metal Oval Technology Sense Sunglasses Personality Handmade Chain Multicolor Sunglasses UV400 – The San Joaquin Valley Sun

    Steampunk Metal Oval Sunglasses with Handmade Multicolor Chain – Bold UV400 Protection and Unique Style

    STELLA Automotive AI Appoints Fred Seidelman as Chief Technology Officer – Yahoo Finance

    STELLA Automotive AI Appoints Fred Seidelman as New Chief Technology Officer

    Saving Energy and Money with Smart Technology – Terms of Service with Clare Duffy – Podcast on CNN Podcasts – CNN

    Saving Energy and Money with Smart Technology – Terms of Service with Clare Duffy – Podcast on CNN Podcasts – CNN

    Four Strategic Signals Technology Leaders Are Tuning In To – SPONSOR CONTENT FROM ARM – Harvard Business Review

    Four Essential Strategic Signals Every Technology Leader Should Watch

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
No Result
View All Result
Earth-News
No Result
View All Result
Home Technology

AI agent benchmarks are misleading, study warns

July 6, 2024
in Technology
AI agent benchmarks are misleading, study warns
Share on FacebookShare on Twitter

July 6, 2024 9:37 AM

AI agents

Image credit: Venturebeat with DALL-E 3

We want to hear from you! Take our quick AI survey and share your insights on the current state of AI, how you’re implementing it, and what you expect to see in the future. Learn More

AI agents are becoming a promising new research direction with potential applications in the real world. These agents use foundation models such as large language models (LLMs) and vision language models (VLMs) to take natural language instructions and pursue complex goals autonomously or semi-autonomously. AI agents can use various tools such as browsers, search engines and code compilers to verify their actions and reason about their goals. 

However, a recent analysis by researchers at Princeton University has revealed several shortcomings in current agent benchmarks and evaluation practices that hinder their usefulness in real-world applications.

Their findings highlight that agent benchmarking comes with distinct challenges, and we can’t evaluate agents in the same way that we benchmark foundation models.

Cost vs accuracy trade-off

One major issue the researchers highlight in their study is the lack of cost control in agent evaluations. AI agents can be much more expensive to run than a single model call, as they often rely on stochastic language models that can produce different results when given the same query multiple times. 

Countdown to VB Transform 2024

Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now

To increase accuracy, some agentic systems generate several responses and use mechanisms like voting or external verification tools to choose the best answer. Sometimes sampling hundreds or thousands of responses can increase the agent’s accuracy. While this approach can improve performance, it comes at a significant computational cost. Inference costs are not always a problem in research settings, where the goal is to maximize accuracy.

However, in practical applications, there is a limit to the budget available for each query, making it crucial for agent evaluations to be cost-controlled. Failing to do so may encourage researchers to develop extremely costly agents simply to top the leaderboard. The Princeton researchers propose visualizing evaluation results as a Pareto curve of accuracy and inference cost and using techniques that jointly optimize the agent for these two metrics.

The researchers evaluated accuracy-cost tradeoffs of different prompting techniques and agentic patterns introduced in different papers.

“For substantially similar accuracy, the cost can differ by almost two orders of magnitude,” the researchers write. “Yet, the cost of running these agents isn’t a top-line metric reported in any of these papers.”

The researchers argue that optimizing for both metrics can lead to “agents that cost less while maintaining accuracy.” Joint optimization can also enable researchers and developers to trade off the fixed and variable costs of running an agent. For example, they can spend more on optimizing the agent’s design but reduce the variable cost by using fewer in-context learning examples in the agent’s prompt.

The researchers tested joint optimization on HotpotQA, a popular question-answering benchmark. Their results show that joint optimization formulation provides a way to strike an optimal balance between accuracy and inference costs.

“Useful agent evaluations must control for cost—even if we ultimately don’t care about cost and only about identifying innovative agent designs,” the researchers write. “Accuracy alone cannot identify progress because it can be improved by scientifically meaningless methods such as retrying.”

Model development vs downstream applications

Another issue the researchers highlight is the difference between evaluating models for research purposes and developing downstream applications. In research, accuracy is often the primary focus, with inference costs being largely ignored. However, when developing real-world applications on AI agents, inference costs play a crucial role in deciding which model and technique to use.

Evaluating inference costs for AI agents is challenging. For example, different model providers can charge different amounts for the same model. Meanwhile, the costs of API calls are regularly changing and might vary based on developers’ decisions. For example, on some platforms, bulk API calls are charged differently. 

The researchers created a website that adjusts model comparisons based on token pricing to address this issue. 

They also conducted a case study on NovelQA, a benchmark for question-answering tasks on very long texts. They found that benchmarks meant for model evaluation can be misleading when used for downstream evaluation. For example, the original NovelQA study makes retrieval-augmented generation (RAG) look much worse than long-context models than it is in a real-world scenario. Their findings show that RAG and long-context models were roughly equally accurate, while long-context models are 20 times more expensive.

Overfitting is a problem

In learning new tasks, machine learning (ML) models often find shortcuts that allow them to score well on benchmarks. One prominent type of shortcut is “overfitting,” where the model finds ways to cheat on the benchmark tests and provides results that do not translate to the real world. The researchers found that overfitting is a serious problem for agent benchmarks, as they tend to be small, typically consisting of only a few hundred samples. This issue is more severe than data contamination in training foundation models, as knowledge of test samples can be directly programmed into the agent.

To address this problem, the researchers suggest that benchmark developers should create and keep holdout test sets that are composed of examples that can’t be memorized during training and can only be solved through a proper understanding of the target task. In their analysis of 17 benchmarks, the researchers found that many lacked proper holdout datasets, allowing agents to take shortcuts, even unintentionally. 

“Surprisingly, we find that many agent benchmarks do not include held-out test sets,” the researchers write. “In addition to creating a test set, benchmark developers should consider keeping it secret to prevent LLM contamination or agent overfitting.”

They also that different types of holdout samples are needed based on the desired level of generality of the task that the agent accomplishes.

“Benchmark developers must do their best to ensure that shortcuts are impossible,” the researchers write. “We view this as the responsibility of benchmark developers rather than agent developers, because designing benchmarks that don’t allow shortcuts is much easier than checking every single agent to see if it takes shortcuts.”

The researchers tested WebArena, a benchmark that evaluates the performance of AI agents in solving problems with different websites. They found several shortcuts in the training datasets that allowed the agents to overfit to tasks in ways that would easily break with minor changes in the real world. For example, the agent could make assumptions about the structure of web addresses without considering that it might change in the future or that it would not work on different websites.

These errors inflate accuracy estimates and lead to over-optimism about agent capabilities, the researchers warn.

With AI agents being a new field, the research and developer communities have yet much to learn about how to test the limits of these new systems that might soon become an important part of everyday applications.

“AI agent benchmarking is new and best practices haven’t yet been established, making it hard to distinguish genuine advances from hype,” the researchers write. “Our thesis is that agents are sufficiently different from models that benchmarking practices need to be rethought.”

VB Daily

Stay in the know! Get the latest news in your inbox daily

By subscribing, you agree to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : VentureBeat – https://venturebeat.com/ai/ai-agent-benchmarks-are-misleading-study-warns/

Tags: Agentbenchmarkstechnology
Previous Post

Beyond GPUs: Innatera and the quiet uprising in AI hardware

Next Post

Odds of Reelecting Joe Biden Fall to 9% on Polymarket – What Does This Suggest?

Technology Is Becoming More Important Than Humans In CX – No Jitter

Technology Is Becoming More Important Than Humans In CX – No Jitter

October 3, 2025
NFL Power Rankings, Week 5: Rodgers, Steelers surge into top 10; Ravens in free fall – CBS Sports

NFL Power Rankings, Week 5: Rodgers, Steelers surge into top 10; Ravens in free fall – CBS Sports

October 2, 2025
Alabama man earns world record for 3-foot, 6-inch beard locks – upi.com

Alabama man earns world record for 3-foot, 6-inch beard locks – upi.com

October 2, 2025
How Trump could use a government shutdown to turbocharge his economic agenda – Yahoo Finance

How Trump could use a government shutdown to turbocharge his economic agenda – Yahoo Finance

October 2, 2025
Toni Braxton Is Turning Her Biggest Hits Into Lifetime Movies – Yahoo

Toni Braxton Is Turning Her Biggest Hits Into Lifetime Movies – Yahoo

October 2, 2025
Reproductive Health Emergency Kits To Be Distributed Saturday At Jacksonville Really Really Free Market – Center for Biological Diversity

Reproductive Health Emergency Kits To Be Distributed Saturday At Jacksonville Really Really Free Market – Center for Biological Diversity

October 2, 2025
Times/Siena Survey: Americans Worry Divisions Cannot Be Overcome – The New York Times

Americans Fear Deep Divisions May Be Impossible to Overcome

October 2, 2025
Oak Ridge Reservation Set for $42M Ecological Restoration, Balancing – Hoodline

Oak Ridge Reservation to Undergo $42M Ecological Restoration and Balancing Effort

October 2, 2025
Mayor green lights Science Center development; residents call it ‘giant win’ for St. Pete – WFLA

Mayor green lights Science Center development; residents call it ‘giant win’ for St. Pete – WFLA

October 2, 2025
A ‘Great Wave’ is rippling through our galaxy, pushing thousands of stars out of place – Live Science

A ‘Great Wave’ is rippling through our galaxy, pushing thousands of stars out of place – Live Science

October 2, 2025

Categories

Archives

October 2025
M T W T F S S
 12345
6789101112
13141516171819
20212223242526
2728293031  
« Sep    
Earth-News.info

The Earth News is an independent English-language daily published Website from all around the World News

Browse by Category

  • Business (20,132)
  • Ecology (847)
  • Economy (868)
  • Entertainment (21,742)
  • General (17,371)
  • Health (9,911)
  • Lifestyle (881)
  • News (22,149)
  • People (870)
  • Politics (879)
  • Science (16,078)
  • Sports (21,369)
  • Technology (15,852)
  • World (851)

Recent News

Technology Is Becoming More Important Than Humans In CX – No Jitter

Technology Is Becoming More Important Than Humans In CX – No Jitter

October 3, 2025
NFL Power Rankings, Week 5: Rodgers, Steelers surge into top 10; Ravens in free fall – CBS Sports

NFL Power Rankings, Week 5: Rodgers, Steelers surge into top 10; Ravens in free fall – CBS Sports

October 2, 2025
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2023 earth-news.info

No Result
View All Result

© 2023 earth-news.info

No Result
View All Result

© 2023 earth-news.info

Go to mobile version