* . *
  • About
  • Advertise
  • Privacy & Policy
  • Contact
Monday, July 7, 2025
Earth-News
  • Home
  • Business
  • Entertainment
    Government whip to withdraw Entertainment Complex Bill on July 9 – Nation Thailand

    Government whip to withdraw Entertainment Complex Bill on July 9 – Nation Thailand

    Magicians and Battlebots light up Las Vegas entertainment scene – KSNV

    Magicians and Battlebots Take Las Vegas Entertainment by Storm

    Max-Matching Entertainments & Longhua District form partnership for new entertainment complex – Blooloop

    Max-Matching Entertainments and Longhua District Unite to Launch Thrilling New Entertainment Complex

    Kennedy Publishing, MGA Entertainment Launch Yummiland Magazine – License Global

    Kennedy Publishing, MGA Entertainment Launch Yummiland Magazine – License Global

    MAY HER SOUL REST IN PEACE 🙏 Veteran entertainment columnist and talent manager Lolit Solis has passed away. She was 78 years old. https://tinyurl.com/6kumarkx | LatestChika.com – Facebook

    Beloved Entertainment Icon Lolit Solis Passes Away at 78 – A Life Remembered with Love and Respect 🙏

    Neil Young Plays Rare Full-Band ‘Ambulance Blues’ With The Chrome Hearts – Yahoo

    Neil Young Stuns Fans with Rare Full-Band Performance of ‘Ambulance Blues’ Alongside The Chrome Hearts

  • General
  • Health
  • News

    Cracking the Code: Why China’s Economic Challenges Aren’t Shaking Markets, Unlike America’s” – Bloomberg

    Trump’s Narrow Window to Spread the Truth About Harris

    Trump’s Narrow Window to Spread the Truth About Harris

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Science
  • Sports
  • Technology
    Technology And Construction Names Join Top Stock Lists: Check Out Additions To IBD 50, Big Cap 20 And More – Investor’s Business Daily

    Technology and Construction Leaders Surge Into Top Stock Rankings: See the Latest Additions to IBD 50, Big Cap 20, and More

    Column: Teach kupuna new technology skills – Honolulu Star-Advertiser

    Empowering Kupuna: Unlocking New Technology Skills for a Connected Future

    EIFO invests $5 million in D3, the Ukraine-focused defence technology venture fund – sUAS News

    EIFO Pledges $5 Million to Supercharge Ukraine-Focused Defense Technology Fund

    New Technology for Water Efficiency and Working with Mexico on Screwworm – AG INFORMATION NETWORK OF THE WEST

    Revolutionary Water Efficiency Technology and Cross-Border Collaboration to Defeat Screwworm

    Environmental cognitive distance, R&D capability distance, and supply chain green technology innovation – Nature

    Bridging Gaps: How Environmental and R&D Differences Drive Green Technology Innovation in Supply Chains

    LG Innotek CEO Moon Hyuksoo: “Our Next-gen Substrate Technology Will Change the Industry Paradigm” – TechPowerUp

    LG Innotek CEO Moon Hyuksoo: “Our Next-Gen Substrate Technology Will Revolutionize the Industry” Revolutionizing the Future: LG Innotek’s CEO Unveils Game-Changing Next-Gen Substrate Technology

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
No Result
View All Result
  • Home
  • Business
  • Entertainment
    Government whip to withdraw Entertainment Complex Bill on July 9 – Nation Thailand

    Government whip to withdraw Entertainment Complex Bill on July 9 – Nation Thailand

    Magicians and Battlebots light up Las Vegas entertainment scene – KSNV

    Magicians and Battlebots Take Las Vegas Entertainment by Storm

    Max-Matching Entertainments & Longhua District form partnership for new entertainment complex – Blooloop

    Max-Matching Entertainments and Longhua District Unite to Launch Thrilling New Entertainment Complex

    Kennedy Publishing, MGA Entertainment Launch Yummiland Magazine – License Global

    Kennedy Publishing, MGA Entertainment Launch Yummiland Magazine – License Global

    MAY HER SOUL REST IN PEACE 🙏 Veteran entertainment columnist and talent manager Lolit Solis has passed away. She was 78 years old. https://tinyurl.com/6kumarkx | LatestChika.com – Facebook

    Beloved Entertainment Icon Lolit Solis Passes Away at 78 – A Life Remembered with Love and Respect 🙏

    Neil Young Plays Rare Full-Band ‘Ambulance Blues’ With The Chrome Hearts – Yahoo

    Neil Young Stuns Fans with Rare Full-Band Performance of ‘Ambulance Blues’ Alongside The Chrome Hearts

  • General
  • Health
  • News

    Cracking the Code: Why China’s Economic Challenges Aren’t Shaking Markets, Unlike America’s” – Bloomberg

    Trump’s Narrow Window to Spread the Truth About Harris

    Trump’s Narrow Window to Spread the Truth About Harris

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Science
  • Sports
  • Technology
    Technology And Construction Names Join Top Stock Lists: Check Out Additions To IBD 50, Big Cap 20 And More – Investor’s Business Daily

    Technology and Construction Leaders Surge Into Top Stock Rankings: See the Latest Additions to IBD 50, Big Cap 20, and More

    Column: Teach kupuna new technology skills – Honolulu Star-Advertiser

    Empowering Kupuna: Unlocking New Technology Skills for a Connected Future

    EIFO invests $5 million in D3, the Ukraine-focused defence technology venture fund – sUAS News

    EIFO Pledges $5 Million to Supercharge Ukraine-Focused Defense Technology Fund

    New Technology for Water Efficiency and Working with Mexico on Screwworm – AG INFORMATION NETWORK OF THE WEST

    Revolutionary Water Efficiency Technology and Cross-Border Collaboration to Defeat Screwworm

    Environmental cognitive distance, R&D capability distance, and supply chain green technology innovation – Nature

    Bridging Gaps: How Environmental and R&D Differences Drive Green Technology Innovation in Supply Chains

    LG Innotek CEO Moon Hyuksoo: “Our Next-gen Substrate Technology Will Change the Industry Paradigm” – TechPowerUp

    LG Innotek CEO Moon Hyuksoo: “Our Next-Gen Substrate Technology Will Revolutionize the Industry” Revolutionizing the Future: LG Innotek’s CEO Unveils Game-Changing Next-Gen Substrate Technology

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
No Result
View All Result
Earth-News
No Result
View All Result
Home Science

How Leaky Datasets Undermine AI Math Reasoning Claims

May 11, 2024
in Science
How Leaky Datasets Undermine AI Math Reasoning Claims
Share on FacebookShare on Twitter

Back in 2019, a group of computer scientists performed a now-famous experiment with far-reaching consequences for artificial intelligence research. At the time, machine vision algorithms were becoming capable of recognizing a wide range of objects with some recording spectacular results in the standard tests used to assess their abilities.

But there was a problem with the method behind all these tests. Almost all the algorithms were trained on a database of labelled images, known as ImageNet. The database contained millions of images which had been carefully described in human-written text to help the machines learn. This effort was crucial for the development of machine vision and ImageNet became a kind of industry standard.

In this way, the computer scientists used a subset of the images to train algorithms to identify a strawberry, a table, a human face and so on, using labelled images in the dataset. They then used a different subset of images to test the algorithms. Over time, computer scientists claimed that their algorithms were becoming increasingly good at recognizing objects in the real world.

Image Recognition

But privately, researchers began to wonder whether this was really true. Because the ImageNet database was becoming so famous, an alternative explanation was that its images, or ones very like them, were leaking into the real world. So AI systems trained on them were just recognizing images they had already seen.

At the time, there was no way to test this because there were no high-quality image databases that hadn’t already been used to train the algorithms.

All that changed when a team from the University of California, Berkeley, created a new dataset of carefully labelled images that they knew the algorithms could not have seen. They then asked the algorithms to identify the objects in the images and found they weren’t as good as everyone had claimed.

Their experiment became a famous example of the pitfalls of relying on single databases for testing machines. Without careful management of this database, AI systems can seem to be good at a task in general but are really only repeating what they have already learnt.

That brings us to the current generation of AI systems which are good at solving certain types of mathematics problems written out in words. For example, “James writes a 3-page letter to 2 different friends twice a week. How many pages does he write a year?”.

The fact that AI systems can answer questions like this suggests they are able to reason. In fact, there is a special database called GSM8K that computer scientists use to test AI system’s reasoning ability. This question is taken from there.

GSM8K is a “dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers.” It consists of some 7500 questions for training an AI system and 1000 questions to test the system.

Over the years, AI systems have become increasingly better at answering these questions. That has led to various claims that AI systems are becoming better at the kind of reasoning needed to solve these problems.

But there is another possibility. This is that GSM8K has become so well known that the test questions have begun to leak into the wild. As a result, AI systems may come across them during their broader benchmark training. So rather than answering them by reasoning, they could just be repeating the answer they saw during their training.

“There is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability,” say Hugh Zhang and colleagues at Scale AI, a start-up based in San Francisco focused on cleaning data for use by AI systems.

Following the lead by the Berkeley researchers, the Scale AI team decided to test this idea by developing their own mathematics test of 1250 questions. They call this GSM1k and have carefully ensured that it closely resembles the GSM8K test but has never been published.

“We took extensive efforts to ensure that GSM1k had a similar distribution of difficulty to GSM8k to ensure an apples-to-apples comparison,” they say. “We ensure that the two benchmarks are comparable across important metrics such as human solve rates, number of steps in solution, answer magnitude, and more.”

They then tested a wide range of AI systems on the GSM1k problems to see how well they performed. And the results make for interesting reading.

It turns out that a large number of AI systems perform significantly worse on the new data set than on the original. “When evaluating leading open- and closed-source LLMs on GSM1k, we observe accuracy drops of up to 13 per cent,” say Zhang and co.

The team point to several systems that seem particularly vulnerable, such as the French AI system Mistral and Microsoft’s smaller AI system, Phi.

Reasoned Response

However, others show little or no drop in performance. These include ChatGPT, Claude and Gemini. Zhang and co say that these models might be better at mathematical reasoning or that their model builders are more careful about data contamination.

The team also ask these systems to generate questions from GSM8K. It turns out that their ability to do this is closely correlated with the difference in their ability to answer GSM1k and GSM8k questions. This strongly suggests the models have partially memorized examples from GSM8k, say Zhang and co.

It’s not all bad news, however, “Many models, even the most heavily overfit families, show strong signs of generalizable mathematical reasoning,” they conclude.

That’s interesting work that reveals the limitations of the benchmarking processes used to test the ability of AI systems. Even though these tests show that there has been significant progress in the reasoning ability of AI systems in recent years, caution is needed in interpreting progress.

The bigger question is how more advanced AI systems can be benchmarked accurately, particularly when the datasets are so difficult to curate and as their abilities become superhuman. It raises the very real possibility that at some point in the future, we will never know the true capability of these machines.

Ref: A Careful Examination of Large Language Model Performance on Grade School Arithmetic : arxiv.org/abs/2405.00332

>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Discover Magazine – https://www.discovermagazine.com/technology/how-leaky-datasets-undermine-ai-math-reasoning-claims

Tags: DatasetsLeakyscience
Previous Post

Israel pushes into Rafah as UN sounds alarm over aid

Next Post

Rare and Endangered, These Non-Parasitic Lampreys Are Far From Home

Louise Harrison – TBR News Media

Louise Harrison – TBR News Media

July 7, 2025
Cape Town’s sewage treatment isn’t coping: scientists are worried about what the city is telling the public – The Conversation

Cape Town’s Sewage Crisis: Scientists Raise Alarms Over What the City Isn’t Telling Residents

July 7, 2025
Titan Could Have An Alien Biosphere – But It Might Be Dog-Sized – ScienceAlert

Could Titan Harbor an Alien Biosphere Filled with Dog-Sized Creatures?

July 7, 2025
Retired woman shares inside look at lifestyle after moving into unconventional tiny home: ‘The best decision I’ve ever made’ – Yahoo

Retired woman shares inside look at lifestyle after moving into unconventional tiny home: ‘The best decision I’ve ever made’ – Yahoo

July 7, 2025
Box Office: ‘Jurassic World Rebirth’ Bites Into $318 Million Globally, ‘Lilo & Stitch’ Nears $975 Million – Variety

Box Office Blockbusters: ‘Jurassic World Rebirth’ Roars to $318 Million as ‘Lilo & Stitch’ Nears $975 Million Milestone

July 7, 2025
NEWS: Stablecoin Tether is money launderering ‘dream currency fuelling global shadow economy’, Economist investigation finds; in EU major rift emerges on stablecoins adoption – AML Intelligence

Stablecoin Tether Exposed as the ‘Dream Currency’ Fueling Global Money Laundering and Shadow Economy; EU Faces Major Divide Over Stablecoin Adoption

July 7, 2025
Government whip to withdraw Entertainment Complex Bill on July 9 – Nation Thailand

Government whip to withdraw Entertainment Complex Bill on July 9 – Nation Thailand

July 7, 2025
HIMS INVESTOR ALERT: Edelson Lechtzin LLP Urges Hims & Hers Health, Inc. (NYSE: HIMS) Shareholders to Consult an Attorney About the Impending Lead Plaintiff Deadline in the Securities Fraud Class Action – Morningstar

HIMS INVESTOR ALERT: Edelson Lechtzin LLP Urges Hims & Hers Health, Inc. (NYSE: HIMS) Shareholders to Consult an Attorney About the Impending Lead Plaintiff Deadline in the Securities Fraud Class Action – Morningstar

July 7, 2025
Trump says Musk has gone ‘off the rails’ after Tesla CEO announces new political party – CNBC

Trump Says Musk Has ‘Gone Off the Rails’ After Tesla CEO Launches New Political Party

July 7, 2025
Technology And Construction Names Join Top Stock Lists: Check Out Additions To IBD 50, Big Cap 20 And More – Investor’s Business Daily

Technology and Construction Leaders Surge Into Top Stock Rankings: See the Latest Additions to IBD 50, Big Cap 20, and More

July 7, 2025

Categories

Archives

July 2025
MTWTFSS
 123456
78910111213
14151617181920
21222324252627
28293031 
« Jun    
Earth-News.info

The Earth News is an independent English-language daily published Website from all around the World News

Browse by Category

  • Business (20,132)
  • Ecology (709)
  • Economy (734)
  • Entertainment (21,622)
  • General (15,761)
  • Health (9,772)
  • Lifestyle (739)
  • News (22,149)
  • People (734)
  • Politics (743)
  • Science (15,951)
  • Sports (21,233)
  • Technology (15,718)
  • World (715)

Recent News

Louise Harrison – TBR News Media

Louise Harrison – TBR News Media

July 7, 2025
Cape Town’s sewage treatment isn’t coping: scientists are worried about what the city is telling the public – The Conversation

Cape Town’s Sewage Crisis: Scientists Raise Alarms Over What the City Isn’t Telling Residents

July 7, 2025
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2023 earth-news.info

No Result
View All Result

© 2023 earth-news.info

No Result
View All Result

© 2023 earth-news.info

Go to mobile version