* . *
  • About
  • Advertise
  • Privacy & Policy
  • Contact
Monday, September 8, 2025
Earth-News
  • Home
  • Business
  • Entertainment
    Monumental Sports & Entertainment Sets Corporate Direction at Nasdaq – PR Newswire

    Monumental Sports & Entertainment Reveals Bold New Corporate Vision at Nasdaq

    The Secret to What Made ‘CarJack’ Work on As the World Turns – yahoo.com

    The Surprising Secret Behind ‘CarJack’s’ Success on As the World Turns

    Victor Garber on his viral “And Just Like That” toilet scene: ‘I was delighted to be doing something ridiculous’ (exclusive) – yahoo.com

    Victor Garber on his viral “And Just Like That” toilet scene: ‘I was delighted to be doing something ridiculous’ (exclusive) – yahoo.com

    Pendulum Announce Homecoming 2026 Australian Tour – yahoo.com

    Pendulum Announces Thrilling Homecoming Tour Across Australia in 2026

    ITV Studios Launches New Entertainment Label – Global Bulletin – IMDb

    ITV Studios Unveils Exciting New Entertainment Label

    TS Entertainment bringing Malibu Jack’s to former Owensboro mall – Lane Report

    TS Entertainment Launches Malibu Jack’s at Former Owensboro Mall Location

  • General
  • Health
  • News

    Cracking the Code: Why China’s Economic Challenges Aren’t Shaking Markets, Unlike America’s” – Bloomberg

    Trump’s Narrow Window to Spread the Truth About Harris

    Trump’s Narrow Window to Spread the Truth About Harris

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Science
  • Sports
  • Technology
    AI will reshape internet, create jobs in West Virginia says High Technology Foundation’s Estep – WV News

    How AI Is Set to Transform the Internet and Boost Job Growth in West Virginia

    Industry partner provides Ferris State Plastics Engineering Technology students with state-of-the-art equipment to gain in-demand skills – Ferris State University

    Industry Partner Equips Ferris State Plastics Engineering Students with Cutting-Edge Technology to Boost In-Demand Skills

    Health Technology Ecosystem – Centers for Medicare & Medicaid Services | CMS (.gov)

    Discover the Future of Health Technology: Innovations Revolutionizing Patient Care

    Coherent Joins LLNL’s STARFIRE Diode Technology Working Group to Advance Inertial Fusion Energy – GlobeNewswire

    Coherent Partners with LLNL’s STARFIRE Team to Drive Breakthroughs in Inertial Fusion Energy

    Gene Associated With Deadly Heart Disease in Golden Retrievers Identified – Technology Networks

    Breakthrough Discovery Uncovers Gene Behind Deadly Heart Disease in Golden Retrievers

    Monkey Island LNG Picks ConocoPhillips’ Liquefaction Technology – Hart Energy

    Monkey Island LNG Selects ConocoPhillips’ Advanced Liquefaction Technology for Next-Gen Energy Solutions

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
No Result
View All Result
  • Home
  • Business
  • Entertainment
    Monumental Sports & Entertainment Sets Corporate Direction at Nasdaq – PR Newswire

    Monumental Sports & Entertainment Reveals Bold New Corporate Vision at Nasdaq

    The Secret to What Made ‘CarJack’ Work on As the World Turns – yahoo.com

    The Surprising Secret Behind ‘CarJack’s’ Success on As the World Turns

    Victor Garber on his viral “And Just Like That” toilet scene: ‘I was delighted to be doing something ridiculous’ (exclusive) – yahoo.com

    Victor Garber on his viral “And Just Like That” toilet scene: ‘I was delighted to be doing something ridiculous’ (exclusive) – yahoo.com

    Pendulum Announce Homecoming 2026 Australian Tour – yahoo.com

    Pendulum Announces Thrilling Homecoming Tour Across Australia in 2026

    ITV Studios Launches New Entertainment Label – Global Bulletin – IMDb

    ITV Studios Unveils Exciting New Entertainment Label

    TS Entertainment bringing Malibu Jack’s to former Owensboro mall – Lane Report

    TS Entertainment Launches Malibu Jack’s at Former Owensboro Mall Location

  • General
  • Health
  • News

    Cracking the Code: Why China’s Economic Challenges Aren’t Shaking Markets, Unlike America’s” – Bloomberg

    Trump’s Narrow Window to Spread the Truth About Harris

    Trump’s Narrow Window to Spread the Truth About Harris

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Science
  • Sports
  • Technology
    AI will reshape internet, create jobs in West Virginia says High Technology Foundation’s Estep – WV News

    How AI Is Set to Transform the Internet and Boost Job Growth in West Virginia

    Industry partner provides Ferris State Plastics Engineering Technology students with state-of-the-art equipment to gain in-demand skills – Ferris State University

    Industry Partner Equips Ferris State Plastics Engineering Students with Cutting-Edge Technology to Boost In-Demand Skills

    Health Technology Ecosystem – Centers for Medicare & Medicaid Services | CMS (.gov)

    Discover the Future of Health Technology: Innovations Revolutionizing Patient Care

    Coherent Joins LLNL’s STARFIRE Diode Technology Working Group to Advance Inertial Fusion Energy – GlobeNewswire

    Coherent Partners with LLNL’s STARFIRE Team to Drive Breakthroughs in Inertial Fusion Energy

    Gene Associated With Deadly Heart Disease in Golden Retrievers Identified – Technology Networks

    Breakthrough Discovery Uncovers Gene Behind Deadly Heart Disease in Golden Retrievers

    Monkey Island LNG Picks ConocoPhillips’ Liquefaction Technology – Hart Energy

    Monkey Island LNG Selects ConocoPhillips’ Advanced Liquefaction Technology for Next-Gen Energy Solutions

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
No Result
View All Result
Earth-News
No Result
View All Result
Home Science

Your Personal Information Is Probably Being Used to Train Generative AI Models

October 19, 2023
in Science
Your Personal Information Is Probably Being Used to Train Generative AI Models
Share on FacebookShare on Twitter

Artists and writers are up in arms about generative artificial intelligence systems—understandably so. These machine learning models are only capable of pumping out images and text because they’ve been trained on mountains of real people’s creative work, much of it copyrighted. Major AI developers including OpenAI, Meta and Stability AI now face multiple lawsuits on this. Such legal claims are supported by independent analyses; in August, for instance, the Atlantic reported finding that Meta trained its large language model (LLM) in part on a data set called Books3, which contained more than 170,000 pirated and copyrighted books.

And training data sets for these models include more than books. In the rush to build and train ever-larger AI models, developers have swept up much of the searchable Internet. This not only has the potential to violate copyrights but also threatens the privacy of the billions of people who share information online. It also means that supposedly neutral models could be trained on biased data. A lack of corporate transparency makes it difficult to figure out exactly where companies are getting their training data—but Scientific American spoke with some AI experts who have a general idea.

Where do AI training data come from?

To build large generative AI models, developers turn to the public-facing Internet. But “there’s no one place where you can go download the Internet,” says Emily M. Bender, a linguist who studies computational linguistics and language technology at the University of Washington. Instead developers amass their training sets through automated tools that catalog and extract data from the Internet. Web “crawlers” travel from link to link indexing the location of information in a database, while Web “scrapers” download and extract that same information.

A very well-resourced company, such as Google’s owner, Alphabet, which already builds Web crawlers to power its search engine, can opt to employ its own tools for the task, says machine learning researcher Jesse Dodge of the nonprofit Allen Institute for AI. Other companies, however, turn to existing resources such as Common Crawl, which helped feed OpenAI’s GPT-3, or databases such as the Large-Scale Artificial Intelligence Open Network (LAION), which contains links to images and their accompanying captions. Neither Common Crawl nor LAION responded to requests for comment. Companies that want to use LAION as an AI resource (it was part of the training set for image generator Stable Diffusion, Dodge says) can follow these links but must download the content themselves.

Web crawlers and scrapers can easily access data from just about anywhere that’s not behind a login page. Social media profiles set to private aren’t included. But data that are viewable in a search engine or without logging into a site, such as a public LinkedIn profile, might still be vacuumed up, Dodge says. Then, he adds, “there’s the kinds of things that absolutely end up in these Web scrapes”—including blogs, personal webpages and company sites. This includes anything on popular photograph-sharing site Flickr, online marketplaces, voter registration databases, government webpages, Wikipedia, Reddit, research repositories, news outlets and academic institutions. Plus, there are pirated content compilations and Web archives, which often contain data that have since been removed from their original location on the Web. And scraped databases do not go away. “If there was text scraped from a public website in 2018, that’s forever going to be available, whether [the site or post has] been taken down or not,” Dodge notes.

Some data crawlers and scrapers are even able to get past paywalls (including Scientific American’s) by disguising themselves behind paid accounts, says Ben Zhao, a computer scientist at the University of Chicago. “You’d be surprised at how far these crawlers and model trainers are willing to go for more data,” Zhao says. Paywalled news sites were among the top data sources included in Google’s C4 database (used to train Google’s LLM T5 and Meta’s LLaMA), according to a joint analysis by the Washington Post and the Allen Institute.

Web scrapers can also hoover up surprising kinds of personal information of unclear origins. Zhao points to one particularly striking example where an artist discovered that a private diagnostic medical image of herself was included in the LAION database. Reporting from Ars Technica confirmed the artist’s account and that the same data set contained medical record photographs of thousands of other people as well. It’s impossible to know exactly how these images ended up being included in LAION, but Zhao points out that data get misplaced, privacy settings are often lax, and leaks and breaches are common. Information not intended for the public Internet ends up there all the time.

In addition to data from these Web scrapes, AI companies might purposefully incorporate other sources—including their own internal data—into their model training. OpenAI fine-tunes its models based on user interactions with its chatbots. Meta has said its latest AI was partially trained on public Facebook and Instagram posts. According to Elon Musk, the social media platform X (formerly known as Twitter) plans to do the same with its own users’ content. Amazon, too, says it will use voice data from customers’ Alexa conversations to train its new LLM.

But beyond these acknowledgements, companies have become increasingly cagey about revealing details on their data sets in recent months. Though Meta offered a general data breakdown in its technical paper on the first version of LLaMA, the release of LLaMA 2 a few months later included far less information. Google, too, didn’t specify its data sources in its recently released PaLM2 AI model, beyond saying that much more data were used to train PaLM2 than to train the original version of PaLM. OpenAI wrote that it would not disclose any details on its training data set or method for GPT-4, citing competition as a chief concern.

Why are dodgy training data a problem?

AI models can regurgitate the same material that was used to train them—including sensitive personal data and copyrighted work. Many widely used generative AI models have blocks meant to prevent them from sharing identifying information about individuals, but researchers have repeatedly demonstrated ways to get around these restrictions. For creative workers, even when AI outputs don’t exactly qualify as plagiarism, Zhao says they can eat into paid opportunities by, for example, aping a specific artist’s unique visual techniques. But without transparency about data sources, it’s difficult to blame such outputs on the AI’s training; after all, it could be coincidentally “hallucinating” the problematic material.

A lack of transparency about training data also raises serious issues related to data bias, says Meredith Broussard, a data journalist who researches artificial intelligence at New York University. “We all know there is wonderful stuff on the Internet, and there is extremely toxic material on the Internet,” she says. Data sets such as Common Crawl, for instance, include white supremacist websites and hate speech. Even less extreme sources of data contain content that promotes stereotypes. Plus, there’s a lot of pornography online. As a result, Broussard points out, AI image generators tend to produce sexualized images of women. “It’s bias in, bias out,” she says.

Bender echoes this concern and points out that the bias goes even deeper—down to who can post content to the Internet in the first place. “That is going to skew wealthy, skew Western, skew towards certain age groups, and so on,” she says. Online harassment compounds the problem by forcing marginalized groups out of some online spaces, Bender adds. This means data scraped from the Internet fail to represent the full diversity of the real world. It’s hard to understand the value and appropriate application of a technology so steeped in skewed information, Bender says, especially if companies aren’t forthright about potential sources of bias.

How can you protect your data from AI?

Unfortunately, there are currently very few options for meaningfully keeping data out of the maws of AI models. Zhao and his colleagues have developed a tool called Glaze, which can be used to make images effectively unreadable to AI models. But the researchers have only been able to test its efficacy with a subset of AI image generators, and its uses are limited. For one thing, it can only protect images that haven’t previously been posted online. Anything else may have already been vacuumed up into Web scrapes and training data sets. As for text, no such similar tool exists.

Website owners can insert digital flags telling Web crawlers and scrapers to not collect site data, Zhao says. It’s up to the scraper developer, however, to opt to abide by these notices.

In California and a handful of other states, recently passed digital privacy laws give consumers the right to request that companies delete their data. In the European Union, too, people have the right to data deletion. So far, however, AI companies have pushed back on such requests by claiming the provenance of the data can’t be proven—or by ignoring the requests altogether—says Jennifer King, a privacy and data researcher at Stanford University.

Even if companies respect such requests and remove your information from a training set, there’s no clear strategy for getting an AI model to unlearn what it has previously absorbed, Zhao says. To truly pull all the copyrighted or potentially sensitive information out of these AI models, one would have to effectively retrain the AI from scratch, which can cost up to tens of millions of dollars, Dodge says.

Currently there are no significant AI policies or legal rulings that would require tech companies to take such actions—and that means they have no incentive to go back to the drawing board.

ABOUT THE AUTHOR(S)

Lauren Leffer is a tech reporting fellow at Scientific American. Previously, she has covered environmental issues, science and health. Follow her on Twitter @lauren_leffer

>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Scientific American – https://www.scientificamerican.com/article/your-personal-information-is-probably-being-used-to-train-generative-ai-models/

Tags: Informationpersonalscience
Previous Post

UK’s global AI summit must provide solutions rather than suggestions

Next Post

This Public Health Measure Bridges the National Divide over Firearms–Just Don’t Call It Gun Control

Invasive flathead catfish now top predators in Susquehanna River in Pennsylvania – Phys.org

Invasive Flathead Catfish Rise to Top Predator Status in Pennsylvania’s Susquehanna River

September 8, 2025
Column | Is whole milk better than low-fat? Here’s what the science says. – The Washington Post

Column | Is whole milk better than low-fat? Here’s what the science says. – The Washington Post

September 8, 2025
Ripple will launch their Fall 2025 Watershed Science Field Season following DayOne – Montana Tech

Ripple Launches Exciting Fall 2025 Watershed Science Field Season

September 8, 2025
Cancer risk according to lifestyle risk score trajectories: a population-based cohort study – Nature

Cancer risk according to lifestyle risk score trajectories: a population-based cohort study – Nature

September 8, 2025
AI will reshape internet, create jobs in West Virginia says High Technology Foundation’s Estep – WV News

How AI Is Set to Transform the Internet and Boost Job Growth in West Virginia

September 8, 2025
University of Missouri changes student ticket claim process to lottery – KOMU 8

University of Missouri Launches Exciting New Lottery System for Student Ticket Claims

September 8, 2025
Poland vs Finland: UEFA World Cup Qualifiers stats & head-to-head – BBC

Poland vs Finland: Key Stats and Head-to-Head Showdown in UEFA World Cup Qualifiers

September 8, 2025
Putin Ally Issues Dire Warning About Russian Economy – Newsweek

Putin Ally Issues Stark Warning About Russia’s Economic Future

September 8, 2025
Monumental Sports & Entertainment Sets Corporate Direction at Nasdaq – PR Newswire

Monumental Sports & Entertainment Reveals Bold New Corporate Vision at Nasdaq

September 8, 2025
Trump’s new law will limit payments to hospitals that treat low-income patients – Stateline

Trump’s New Law Targets Major Cuts to Payments for Hospitals Serving Low-Income Patients

September 8, 2025

Categories

Archives

September 2025
MTWTFSS
1234567
891011121314
15161718192021
22232425262728
2930 
« Aug    
Earth-News.info

The Earth News is an independent English-language daily published Website from all around the World News

Browse by Category

  • Business (20,132)
  • Ecology (813)
  • Economy (830)
  • Entertainment (21,708)
  • General (16,921)
  • Health (9,872)
  • Lifestyle (844)
  • News (22,149)
  • People (833)
  • Politics (837)
  • Science (16,040)
  • Sports (21,330)
  • Technology (15,811)
  • World (812)

Recent News

Invasive flathead catfish now top predators in Susquehanna River in Pennsylvania – Phys.org

Invasive Flathead Catfish Rise to Top Predator Status in Pennsylvania’s Susquehanna River

September 8, 2025
Column | Is whole milk better than low-fat? Here’s what the science says. – The Washington Post

Column | Is whole milk better than low-fat? Here’s what the science says. – The Washington Post

September 8, 2025
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2023 earth-news.info

No Result
View All Result

© 2023 earth-news.info

No Result
View All Result

© 2023 earth-news.info

Go to mobile version