* . *
  • About
  • Advertise
  • Privacy & Policy
  • Contact
Monday, August 11, 2025
Earth-News
  • Home
  • Business
  • Entertainment
    This LA singer performed at Trump casinos. Now he’s a retired bus driver in Acadiana. – The Advocate

    This LA singer performed at Trump casinos. Now he’s a retired bus driver in Acadiana. – The Advocate

    Six Flags Entertainment Corporation Reports 2025 Second Quarter Results, Provides July Performance Update, and Updates Full-Year Guidance – Business Wire

    Six Flags Reveals Thrilling Q2 2025 Results, Shares July Highlights, and Updates Full-Year Outlook

    ‘Paying homage to Kansas’: Singer-songwriter Dallas Pryor shares music journey – The Topeka Capital-Journal

    Honoring Kansas: Singer-Songwriter Dallas Pryor Shares His Inspiring Musical Journey

    Alabama expands entertainment incentives to boost state’s music and creative industries – Made in Alabama

    Alabama Supercharges Entertainment Incentives to Spark Explosive Growth in Music and Creative Industries

    Peacock’s Biggest Action Show Streams 2 New Episodes Sooner Than You Think – yahoo.com

    Peacock’s Hottest Action Show Drops 2 New Episodes Sooner Than Expected!

    Themed Entertainment Design – Purdue Polytechnic

    Innovative Themed Entertainment Design: Creating Immersive Experiences

  • General
  • Health
  • News

    Cracking the Code: Why China’s Economic Challenges Aren’t Shaking Markets, Unlike America’s” – Bloomberg

    Trump’s Narrow Window to Spread the Truth About Harris

    Trump’s Narrow Window to Spread the Truth About Harris

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Science
  • Sports
  • Technology
    Gas-to-liquids technology can support national resilience – The Strategist | ASPI’s analysis and commentary site

    Unlocking National Strength: How Gas-to-Liquids Technology Drives Resilience

    Micron Technology (MU) Launched a New Memory Chip for Space Application – Yahoo Finance

    Micron Technology Launches Revolutionary Memory Chip Built for Space Exploration

    United Airlines passengers in US delayed after tech glitch halts flights – BBC

    United Airlines passengers in US delayed after tech glitch halts flights – BBC

    Preparing Students for the Technology of Tomorrow – Drug Topics

    Preparing Students Today to Thrive in Tomorrow’s Tech-Driven World

    Technology, History, and Summer Camp at the Rhode Island Computer Museum – abc6.com

    Discover Technology, History, and Summer Camp Adventures at the Rhode Island Computer Museum

    MBU showcases student work at Occupational Therapy Technology Fair – WHSV

    Discover the Most Innovative Student Projects at the Occupational Therapy Technology Fair

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
No Result
View All Result
  • Home
  • Business
  • Entertainment
    This LA singer performed at Trump casinos. Now he’s a retired bus driver in Acadiana. – The Advocate

    This LA singer performed at Trump casinos. Now he’s a retired bus driver in Acadiana. – The Advocate

    Six Flags Entertainment Corporation Reports 2025 Second Quarter Results, Provides July Performance Update, and Updates Full-Year Guidance – Business Wire

    Six Flags Reveals Thrilling Q2 2025 Results, Shares July Highlights, and Updates Full-Year Outlook

    ‘Paying homage to Kansas’: Singer-songwriter Dallas Pryor shares music journey – The Topeka Capital-Journal

    Honoring Kansas: Singer-Songwriter Dallas Pryor Shares His Inspiring Musical Journey

    Alabama expands entertainment incentives to boost state’s music and creative industries – Made in Alabama

    Alabama Supercharges Entertainment Incentives to Spark Explosive Growth in Music and Creative Industries

    Peacock’s Biggest Action Show Streams 2 New Episodes Sooner Than You Think – yahoo.com

    Peacock’s Hottest Action Show Drops 2 New Episodes Sooner Than Expected!

    Themed Entertainment Design – Purdue Polytechnic

    Innovative Themed Entertainment Design: Creating Immersive Experiences

  • General
  • Health
  • News

    Cracking the Code: Why China’s Economic Challenges Aren’t Shaking Markets, Unlike America’s” – Bloomberg

    Trump’s Narrow Window to Spread the Truth About Harris

    Trump’s Narrow Window to Spread the Truth About Harris

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Science
  • Sports
  • Technology
    Gas-to-liquids technology can support national resilience – The Strategist | ASPI’s analysis and commentary site

    Unlocking National Strength: How Gas-to-Liquids Technology Drives Resilience

    Micron Technology (MU) Launched a New Memory Chip for Space Application – Yahoo Finance

    Micron Technology Launches Revolutionary Memory Chip Built for Space Exploration

    United Airlines passengers in US delayed after tech glitch halts flights – BBC

    United Airlines passengers in US delayed after tech glitch halts flights – BBC

    Preparing Students for the Technology of Tomorrow – Drug Topics

    Preparing Students Today to Thrive in Tomorrow’s Tech-Driven World

    Technology, History, and Summer Camp at the Rhode Island Computer Museum – abc6.com

    Discover Technology, History, and Summer Camp Adventures at the Rhode Island Computer Museum

    MBU showcases student work at Occupational Therapy Technology Fair – WHSV

    Discover the Most Innovative Student Projects at the Occupational Therapy Technology Fair

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
No Result
View All Result
Earth-News
No Result
View All Result
Home Science

Your Personal Information Is Probably Being Used to Train Generative AI Models

October 19, 2023
in Science
Your Personal Information Is Probably Being Used to Train Generative AI Models
Share on FacebookShare on Twitter

Artists and writers are up in arms about generative artificial intelligence systems—understandably so. These machine learning models are only capable of pumping out images and text because they’ve been trained on mountains of real people’s creative work, much of it copyrighted. Major AI developers including OpenAI, Meta and Stability AI now face multiple lawsuits on this. Such legal claims are supported by independent analyses; in August, for instance, the Atlantic reported finding that Meta trained its large language model (LLM) in part on a data set called Books3, which contained more than 170,000 pirated and copyrighted books.

And training data sets for these models include more than books. In the rush to build and train ever-larger AI models, developers have swept up much of the searchable Internet. This not only has the potential to violate copyrights but also threatens the privacy of the billions of people who share information online. It also means that supposedly neutral models could be trained on biased data. A lack of corporate transparency makes it difficult to figure out exactly where companies are getting their training data—but Scientific American spoke with some AI experts who have a general idea.

Where do AI training data come from?

To build large generative AI models, developers turn to the public-facing Internet. But “there’s no one place where you can go download the Internet,” says Emily M. Bender, a linguist who studies computational linguistics and language technology at the University of Washington. Instead developers amass their training sets through automated tools that catalog and extract data from the Internet. Web “crawlers” travel from link to link indexing the location of information in a database, while Web “scrapers” download and extract that same information.

A very well-resourced company, such as Google’s owner, Alphabet, which already builds Web crawlers to power its search engine, can opt to employ its own tools for the task, says machine learning researcher Jesse Dodge of the nonprofit Allen Institute for AI. Other companies, however, turn to existing resources such as Common Crawl, which helped feed OpenAI’s GPT-3, or databases such as the Large-Scale Artificial Intelligence Open Network (LAION), which contains links to images and their accompanying captions. Neither Common Crawl nor LAION responded to requests for comment. Companies that want to use LAION as an AI resource (it was part of the training set for image generator Stable Diffusion, Dodge says) can follow these links but must download the content themselves.

Web crawlers and scrapers can easily access data from just about anywhere that’s not behind a login page. Social media profiles set to private aren’t included. But data that are viewable in a search engine or without logging into a site, such as a public LinkedIn profile, might still be vacuumed up, Dodge says. Then, he adds, “there’s the kinds of things that absolutely end up in these Web scrapes”—including blogs, personal webpages and company sites. This includes anything on popular photograph-sharing site Flickr, online marketplaces, voter registration databases, government webpages, Wikipedia, Reddit, research repositories, news outlets and academic institutions. Plus, there are pirated content compilations and Web archives, which often contain data that have since been removed from their original location on the Web. And scraped databases do not go away. “If there was text scraped from a public website in 2018, that’s forever going to be available, whether [the site or post has] been taken down or not,” Dodge notes.

Some data crawlers and scrapers are even able to get past paywalls (including Scientific American’s) by disguising themselves behind paid accounts, says Ben Zhao, a computer scientist at the University of Chicago. “You’d be surprised at how far these crawlers and model trainers are willing to go for more data,” Zhao says. Paywalled news sites were among the top data sources included in Google’s C4 database (used to train Google’s LLM T5 and Meta’s LLaMA), according to a joint analysis by the Washington Post and the Allen Institute.

Web scrapers can also hoover up surprising kinds of personal information of unclear origins. Zhao points to one particularly striking example where an artist discovered that a private diagnostic medical image of herself was included in the LAION database. Reporting from Ars Technica confirmed the artist’s account and that the same data set contained medical record photographs of thousands of other people as well. It’s impossible to know exactly how these images ended up being included in LAION, but Zhao points out that data get misplaced, privacy settings are often lax, and leaks and breaches are common. Information not intended for the public Internet ends up there all the time.

In addition to data from these Web scrapes, AI companies might purposefully incorporate other sources—including their own internal data—into their model training. OpenAI fine-tunes its models based on user interactions with its chatbots. Meta has said its latest AI was partially trained on public Facebook and Instagram posts. According to Elon Musk, the social media platform X (formerly known as Twitter) plans to do the same with its own users’ content. Amazon, too, says it will use voice data from customers’ Alexa conversations to train its new LLM.

But beyond these acknowledgements, companies have become increasingly cagey about revealing details on their data sets in recent months. Though Meta offered a general data breakdown in its technical paper on the first version of LLaMA, the release of LLaMA 2 a few months later included far less information. Google, too, didn’t specify its data sources in its recently released PaLM2 AI model, beyond saying that much more data were used to train PaLM2 than to train the original version of PaLM. OpenAI wrote that it would not disclose any details on its training data set or method for GPT-4, citing competition as a chief concern.

Why are dodgy training data a problem?

AI models can regurgitate the same material that was used to train them—including sensitive personal data and copyrighted work. Many widely used generative AI models have blocks meant to prevent them from sharing identifying information about individuals, but researchers have repeatedly demonstrated ways to get around these restrictions. For creative workers, even when AI outputs don’t exactly qualify as plagiarism, Zhao says they can eat into paid opportunities by, for example, aping a specific artist’s unique visual techniques. But without transparency about data sources, it’s difficult to blame such outputs on the AI’s training; after all, it could be coincidentally “hallucinating” the problematic material.

A lack of transparency about training data also raises serious issues related to data bias, says Meredith Broussard, a data journalist who researches artificial intelligence at New York University. “We all know there is wonderful stuff on the Internet, and there is extremely toxic material on the Internet,” she says. Data sets such as Common Crawl, for instance, include white supremacist websites and hate speech. Even less extreme sources of data contain content that promotes stereotypes. Plus, there’s a lot of pornography online. As a result, Broussard points out, AI image generators tend to produce sexualized images of women. “It’s bias in, bias out,” she says.

Bender echoes this concern and points out that the bias goes even deeper—down to who can post content to the Internet in the first place. “That is going to skew wealthy, skew Western, skew towards certain age groups, and so on,” she says. Online harassment compounds the problem by forcing marginalized groups out of some online spaces, Bender adds. This means data scraped from the Internet fail to represent the full diversity of the real world. It’s hard to understand the value and appropriate application of a technology so steeped in skewed information, Bender says, especially if companies aren’t forthright about potential sources of bias.

How can you protect your data from AI?

Unfortunately, there are currently very few options for meaningfully keeping data out of the maws of AI models. Zhao and his colleagues have developed a tool called Glaze, which can be used to make images effectively unreadable to AI models. But the researchers have only been able to test its efficacy with a subset of AI image generators, and its uses are limited. For one thing, it can only protect images that haven’t previously been posted online. Anything else may have already been vacuumed up into Web scrapes and training data sets. As for text, no such similar tool exists.

Website owners can insert digital flags telling Web crawlers and scrapers to not collect site data, Zhao says. It’s up to the scraper developer, however, to opt to abide by these notices.

In California and a handful of other states, recently passed digital privacy laws give consumers the right to request that companies delete their data. In the European Union, too, people have the right to data deletion. So far, however, AI companies have pushed back on such requests by claiming the provenance of the data can’t be proven—or by ignoring the requests altogether—says Jennifer King, a privacy and data researcher at Stanford University.

Even if companies respect such requests and remove your information from a training set, there’s no clear strategy for getting an AI model to unlearn what it has previously absorbed, Zhao says. To truly pull all the copyrighted or potentially sensitive information out of these AI models, one would have to effectively retrain the AI from scratch, which can cost up to tens of millions of dollars, Dodge says.

Currently there are no significant AI policies or legal rulings that would require tech companies to take such actions—and that means they have no incentive to go back to the drawing board.

ABOUT THE AUTHOR(S)

Lauren Leffer is a tech reporting fellow at Scientific American. Previously, she has covered environmental issues, science and health. Follow her on Twitter @lauren_leffer

>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Scientific American – https://www.scientificamerican.com/article/your-personal-information-is-probably-being-used-to-train-generative-ai-models/

Tags: Informationpersonalscience
Previous Post

UK’s global AI summit must provide solutions rather than suggestions

Next Post

This Public Health Measure Bridges the National Divide over Firearms–Just Don’t Call It Gun Control

Plastisphere provides a unique ecological niche for microorganisms in Zostera marina seagrass meadows – Nature

Plastisphere provides a unique ecological niche for microorganisms in Zostera marina seagrass meadows – Nature

August 11, 2025
‘The best solution is to murder him in his sleep’: AI models can send subliminal messages that teach other AIs to be ‘evil,’ study claims – Live Science

AI Models May Secretly Teach Each Other to Be ‘Evil’ Through Subliminal Messages, Study Warns

August 11, 2025
Concerns Emerge Over Potential Cancer Links to Drugs Like Ozempic – ScienceAlert

Concerns Emerge Over Potential Cancer Links to Drugs Like Ozempic – ScienceAlert

August 11, 2025
Exploring the Links Between Demographics, Lifestyle, Comorbidities, Prediabetes, and Mortality – BIOENGINEER.ORG

How Demographics, Lifestyle, and Health Conditions Shape Prediabetes and Mortality Risk

August 11, 2025
Gas-to-liquids technology can support national resilience – The Strategist | ASPI’s analysis and commentary site

Unlocking National Strength: How Gas-to-Liquids Technology Drives Resilience

August 11, 2025
From the Texas offensive line to Michigan RB room, these non-QB questions need answers ahead of 2025 season – CBS Sports

Crucial Position Battles from Texas’ Offensive Line to Michigan’s Running Backs That Will Define the 2025 Season

August 11, 2025
Activists plant war protest doll inside Disneyland – SFGATE

Activists Ignite Outrage by Planting War Protest Doll Inside Disneyland

August 11, 2025
Trump, when in trouble, throws tantrums. The economy is his latest conniption. | Opinion – USA Today

Trump, when in trouble, throws tantrums. The economy is his latest conniption. | Opinion – USA Today

August 11, 2025
This LA singer performed at Trump casinos. Now he’s a retired bus driver in Acadiana. – The Advocate

This LA singer performed at Trump casinos. Now he’s a retired bus driver in Acadiana. – The Advocate

August 11, 2025
Laramie County health and food inspections (8/1/25–8/7/25) – Cap City News

Laramie County Health and Food Inspections: Key Findings from August 1-7, 2025

August 11, 2025

Categories

Archives

August 2025
MTWTFSS
 123
45678910
11121314151617
18192021222324
25262728293031
« Jul    
Earth-News.info

The Earth News is an independent English-language daily published Website from all around the World News

Browse by Category

  • Business (20,132)
  • Ecology (765)
  • Economy (787)
  • Entertainment (21,664)
  • General (16,402)
  • Health (9,826)
  • Lifestyle (798)
  • News (22,149)
  • People (789)
  • Politics (797)
  • Science (16,001)
  • Sports (21,285)
  • Technology (15,768)
  • World (770)

Recent News

Plastisphere provides a unique ecological niche for microorganisms in Zostera marina seagrass meadows – Nature

Plastisphere provides a unique ecological niche for microorganisms in Zostera marina seagrass meadows – Nature

August 11, 2025
‘The best solution is to murder him in his sleep’: AI models can send subliminal messages that teach other AIs to be ‘evil,’ study claims – Live Science

AI Models May Secretly Teach Each Other to Be ‘Evil’ Through Subliminal Messages, Study Warns

August 11, 2025
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2023 earth-news.info

No Result
View All Result

© 2023 earth-news.info

No Result
View All Result

© 2023 earth-news.info

Go to mobile version