* . *
  • About
  • Advertise
  • Privacy & Policy
  • Contact
Wednesday, December 10, 2025
Earth-News
  • Home
  • Business
  • Entertainment
    What Netflix’s Acquisition of Warner Bros. Means for the Movies – WKTV

    How Netflix’s Acquisition of Warner Bros. Is Set to Revolutionize the Future of Movies

    ‘An entertainment pavilion on bones’: new Russian museum opens in occupied Mariupol – The Art Newspaper

    ‘An entertainment pavilion on bones’: new Russian museum opens in occupied Mariupol – The Art Newspaper

    5th Miramar International Fashion Weekend brings runway shows, live entertainment to City Hall Plaza – WSVN

    5th Miramar International Fashion Weekend brings runway shows, live entertainment to City Hall Plaza – WSVN

    Country music icon updates fans after heart attack: ‘Got a lot of work I want to do’ – PennLive.com

    Country music icon updates fans after heart attack: ‘Got a lot of work I want to do’ – PennLive.com

    Ex-‘Grey’s Anatomy’ star opens up battle against incurable disease – PennLive.com

    Ex-‘Grey’s Anatomy’ star opens up battle against incurable disease – PennLive.com

    “This acquisition brings together two pioneering entertainment businesses, combining Netflix’s innovation, global reach and best-in-class streaming service with Warner Bros.’ century-long legacy of world-class storytelling.” – facebook.com

    Netflix and Warner Bros. Join Forces to Revolutionize Entertainment with Unmatched Innovation and Legendary Storytelling

  • General
  • Health
  • News

    Cracking the Code: Why China’s Economic Challenges Aren’t Shaking Markets, Unlike America’s” – Bloomberg

    Trump’s Narrow Window to Spread the Truth About Harris

    Trump’s Narrow Window to Spread the Truth About Harris

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Science
  • Sports
  • Technology
    Geothermal Heat Exchange Technology Evaluated as a Potential Solution for Grid Support and Sustainable Cooling in Hawaii – SolarQuarter

    Exploring Geothermal Heat Exchange Technology as a Game-Changer for Grid Support and Sustainable Cooling in Hawaii

    Pompeii offers insights into ancient Roman building technology – MIT News

    Uncover the Hidden Secrets of Ancient Roman Building Technology Through Pompeii

    Orlando Airport Expands Use of Facial ID Technology – GovTech

    Orlando Airport Boosts Security with Cutting-Edge Facial Recognition Technology

    Nearly 50% crash in Kaynes Technology share price wipes out ₹5000 crore wealth of Mutual funds – livemint.com

    Nearly 50% crash in Kaynes Technology share price wipes out ₹5000 crore wealth of Mutual funds – livemint.com

    Oregon fisheries try old technology to boost salmon returns – Oregon Public Broadcasting – OPB

    Oregon Fisheries Turn to Time-Tested Techniques to Boost Salmon Returns

    An Intrinsic Calculation For Bytes Technology Group plc (LON:BYIT) Suggests It’s 27% Undervalued – Yahoo Finance

    Intrinsic Valuation Reveals Bytes Technology Group Is Undervalued by 27%

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
No Result
View All Result
  • Home
  • Business
  • Entertainment
    What Netflix’s Acquisition of Warner Bros. Means for the Movies – WKTV

    How Netflix’s Acquisition of Warner Bros. Is Set to Revolutionize the Future of Movies

    ‘An entertainment pavilion on bones’: new Russian museum opens in occupied Mariupol – The Art Newspaper

    ‘An entertainment pavilion on bones’: new Russian museum opens in occupied Mariupol – The Art Newspaper

    5th Miramar International Fashion Weekend brings runway shows, live entertainment to City Hall Plaza – WSVN

    5th Miramar International Fashion Weekend brings runway shows, live entertainment to City Hall Plaza – WSVN

    Country music icon updates fans after heart attack: ‘Got a lot of work I want to do’ – PennLive.com

    Country music icon updates fans after heart attack: ‘Got a lot of work I want to do’ – PennLive.com

    Ex-‘Grey’s Anatomy’ star opens up battle against incurable disease – PennLive.com

    Ex-‘Grey’s Anatomy’ star opens up battle against incurable disease – PennLive.com

    “This acquisition brings together two pioneering entertainment businesses, combining Netflix’s innovation, global reach and best-in-class streaming service with Warner Bros.’ century-long legacy of world-class storytelling.” – facebook.com

    Netflix and Warner Bros. Join Forces to Revolutionize Entertainment with Unmatched Innovation and Legendary Storytelling

  • General
  • Health
  • News

    Cracking the Code: Why China’s Economic Challenges Aren’t Shaking Markets, Unlike America’s” – Bloomberg

    Trump’s Narrow Window to Spread the Truth About Harris

    Trump’s Narrow Window to Spread the Truth About Harris

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Science
  • Sports
  • Technology
    Geothermal Heat Exchange Technology Evaluated as a Potential Solution for Grid Support and Sustainable Cooling in Hawaii – SolarQuarter

    Exploring Geothermal Heat Exchange Technology as a Game-Changer for Grid Support and Sustainable Cooling in Hawaii

    Pompeii offers insights into ancient Roman building technology – MIT News

    Uncover the Hidden Secrets of Ancient Roman Building Technology Through Pompeii

    Orlando Airport Expands Use of Facial ID Technology – GovTech

    Orlando Airport Boosts Security with Cutting-Edge Facial Recognition Technology

    Nearly 50% crash in Kaynes Technology share price wipes out ₹5000 crore wealth of Mutual funds – livemint.com

    Nearly 50% crash in Kaynes Technology share price wipes out ₹5000 crore wealth of Mutual funds – livemint.com

    Oregon fisheries try old technology to boost salmon returns – Oregon Public Broadcasting – OPB

    Oregon Fisheries Turn to Time-Tested Techniques to Boost Salmon Returns

    An Intrinsic Calculation For Bytes Technology Group plc (LON:BYIT) Suggests It’s 27% Undervalued – Yahoo Finance

    Intrinsic Valuation Reveals Bytes Technology Group Is Undervalued by 27%

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
No Result
View All Result
Earth-News
No Result
View All Result
Home Science

Your Personal Information Is Probably Being Used to Train Generative AI Models

October 19, 2023
in Science
Your Personal Information Is Probably Being Used to Train Generative AI Models
Share on FacebookShare on Twitter

Artists and writers are up in arms about generative artificial intelligence systems—understandably so. These machine learning models are only capable of pumping out images and text because they’ve been trained on mountains of real people’s creative work, much of it copyrighted. Major AI developers including OpenAI, Meta and Stability AI now face multiple lawsuits on this. Such legal claims are supported by independent analyses; in August, for instance, the Atlantic reported finding that Meta trained its large language model (LLM) in part on a data set called Books3, which contained more than 170,000 pirated and copyrighted books.

And training data sets for these models include more than books. In the rush to build and train ever-larger AI models, developers have swept up much of the searchable Internet. This not only has the potential to violate copyrights but also threatens the privacy of the billions of people who share information online. It also means that supposedly neutral models could be trained on biased data. A lack of corporate transparency makes it difficult to figure out exactly where companies are getting their training data—but Scientific American spoke with some AI experts who have a general idea.

Where do AI training data come from?

To build large generative AI models, developers turn to the public-facing Internet. But “there’s no one place where you can go download the Internet,” says Emily M. Bender, a linguist who studies computational linguistics and language technology at the University of Washington. Instead developers amass their training sets through automated tools that catalog and extract data from the Internet. Web “crawlers” travel from link to link indexing the location of information in a database, while Web “scrapers” download and extract that same information.

A very well-resourced company, such as Google’s owner, Alphabet, which already builds Web crawlers to power its search engine, can opt to employ its own tools for the task, says machine learning researcher Jesse Dodge of the nonprofit Allen Institute for AI. Other companies, however, turn to existing resources such as Common Crawl, which helped feed OpenAI’s GPT-3, or databases such as the Large-Scale Artificial Intelligence Open Network (LAION), which contains links to images and their accompanying captions. Neither Common Crawl nor LAION responded to requests for comment. Companies that want to use LAION as an AI resource (it was part of the training set for image generator Stable Diffusion, Dodge says) can follow these links but must download the content themselves.

Web crawlers and scrapers can easily access data from just about anywhere that’s not behind a login page. Social media profiles set to private aren’t included. But data that are viewable in a search engine or without logging into a site, such as a public LinkedIn profile, might still be vacuumed up, Dodge says. Then, he adds, “there’s the kinds of things that absolutely end up in these Web scrapes”—including blogs, personal webpages and company sites. This includes anything on popular photograph-sharing site Flickr, online marketplaces, voter registration databases, government webpages, Wikipedia, Reddit, research repositories, news outlets and academic institutions. Plus, there are pirated content compilations and Web archives, which often contain data that have since been removed from their original location on the Web. And scraped databases do not go away. “If there was text scraped from a public website in 2018, that’s forever going to be available, whether [the site or post has] been taken down or not,” Dodge notes.

Some data crawlers and scrapers are even able to get past paywalls (including Scientific American’s) by disguising themselves behind paid accounts, says Ben Zhao, a computer scientist at the University of Chicago. “You’d be surprised at how far these crawlers and model trainers are willing to go for more data,” Zhao says. Paywalled news sites were among the top data sources included in Google’s C4 database (used to train Google’s LLM T5 and Meta’s LLaMA), according to a joint analysis by the Washington Post and the Allen Institute.

Web scrapers can also hoover up surprising kinds of personal information of unclear origins. Zhao points to one particularly striking example where an artist discovered that a private diagnostic medical image of herself was included in the LAION database. Reporting from Ars Technica confirmed the artist’s account and that the same data set contained medical record photographs of thousands of other people as well. It’s impossible to know exactly how these images ended up being included in LAION, but Zhao points out that data get misplaced, privacy settings are often lax, and leaks and breaches are common. Information not intended for the public Internet ends up there all the time.

In addition to data from these Web scrapes, AI companies might purposefully incorporate other sources—including their own internal data—into their model training. OpenAI fine-tunes its models based on user interactions with its chatbots. Meta has said its latest AI was partially trained on public Facebook and Instagram posts. According to Elon Musk, the social media platform X (formerly known as Twitter) plans to do the same with its own users’ content. Amazon, too, says it will use voice data from customers’ Alexa conversations to train its new LLM.

But beyond these acknowledgements, companies have become increasingly cagey about revealing details on their data sets in recent months. Though Meta offered a general data breakdown in its technical paper on the first version of LLaMA, the release of LLaMA 2 a few months later included far less information. Google, too, didn’t specify its data sources in its recently released PaLM2 AI model, beyond saying that much more data were used to train PaLM2 than to train the original version of PaLM. OpenAI wrote that it would not disclose any details on its training data set or method for GPT-4, citing competition as a chief concern.

Why are dodgy training data a problem?

AI models can regurgitate the same material that was used to train them—including sensitive personal data and copyrighted work. Many widely used generative AI models have blocks meant to prevent them from sharing identifying information about individuals, but researchers have repeatedly demonstrated ways to get around these restrictions. For creative workers, even when AI outputs don’t exactly qualify as plagiarism, Zhao says they can eat into paid opportunities by, for example, aping a specific artist’s unique visual techniques. But without transparency about data sources, it’s difficult to blame such outputs on the AI’s training; after all, it could be coincidentally “hallucinating” the problematic material.

A lack of transparency about training data also raises serious issues related to data bias, says Meredith Broussard, a data journalist who researches artificial intelligence at New York University. “We all know there is wonderful stuff on the Internet, and there is extremely toxic material on the Internet,” she says. Data sets such as Common Crawl, for instance, include white supremacist websites and hate speech. Even less extreme sources of data contain content that promotes stereotypes. Plus, there’s a lot of pornography online. As a result, Broussard points out, AI image generators tend to produce sexualized images of women. “It’s bias in, bias out,” she says.

Bender echoes this concern and points out that the bias goes even deeper—down to who can post content to the Internet in the first place. “That is going to skew wealthy, skew Western, skew towards certain age groups, and so on,” she says. Online harassment compounds the problem by forcing marginalized groups out of some online spaces, Bender adds. This means data scraped from the Internet fail to represent the full diversity of the real world. It’s hard to understand the value and appropriate application of a technology so steeped in skewed information, Bender says, especially if companies aren’t forthright about potential sources of bias.

How can you protect your data from AI?

Unfortunately, there are currently very few options for meaningfully keeping data out of the maws of AI models. Zhao and his colleagues have developed a tool called Glaze, which can be used to make images effectively unreadable to AI models. But the researchers have only been able to test its efficacy with a subset of AI image generators, and its uses are limited. For one thing, it can only protect images that haven’t previously been posted online. Anything else may have already been vacuumed up into Web scrapes and training data sets. As for text, no such similar tool exists.

Website owners can insert digital flags telling Web crawlers and scrapers to not collect site data, Zhao says. It’s up to the scraper developer, however, to opt to abide by these notices.

In California and a handful of other states, recently passed digital privacy laws give consumers the right to request that companies delete their data. In the European Union, too, people have the right to data deletion. So far, however, AI companies have pushed back on such requests by claiming the provenance of the data can’t be proven—or by ignoring the requests altogether—says Jennifer King, a privacy and data researcher at Stanford University.

Even if companies respect such requests and remove your information from a training set, there’s no clear strategy for getting an AI model to unlearn what it has previously absorbed, Zhao says. To truly pull all the copyrighted or potentially sensitive information out of these AI models, one would have to effectively retrain the AI from scratch, which can cost up to tens of millions of dollars, Dodge says.

Currently there are no significant AI policies or legal rulings that would require tech companies to take such actions—and that means they have no incentive to go back to the drawing board.

ABOUT THE AUTHOR(S)

Lauren Leffer is a tech reporting fellow at Scientific American. Previously, she has covered environmental issues, science and health. Follow her on Twitter @lauren_leffer

>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Scientific American – https://www.scientificamerican.com/article/your-personal-information-is-probably-being-used-to-train-generative-ai-models/

Tags: Informationpersonalscience
Previous Post

UK’s global AI summit must provide solutions rather than suggestions

Next Post

This Public Health Measure Bridges the National Divide over Firearms–Just Don’t Call It Gun Control

II. Capitalism and Ecology: The Nature of the Contradiction – Monthly Review

The Clash Between Capitalism and Ecology: Unraveling the Core Contradiction Rewritten title: When Profit Meets Planet: Exploring the Deep Conflict Between Capitalism and Ecology

December 10, 2025
Thank goodness for a Free Press and Science – Marler Blog

Grateful for the Power of a Free Press and Science

December 10, 2025
Congressional Inquiry Into Science and Technology Agency Offices of Civil Rights – NASA Watch

Congress Launches Major Investigation into Civil Rights Practices at Science and Technology Agencies

December 10, 2025
Equity Lifestyle Properties, Inc. $ELS Position Boosted by First Trust Advisors LP – MarketBeat

First Trust Advisors LP Increases Stake in Equity Lifestyle Properties, Inc. $ELS

December 10, 2025
Geothermal Heat Exchange Technology Evaluated as a Potential Solution for Grid Support and Sustainable Cooling in Hawaii – SolarQuarter

Exploring Geothermal Heat Exchange Technology as a Game-Changer for Grid Support and Sustainable Cooling in Hawaii

December 10, 2025
Champions League live updates – Yahoo Sports

Champions League live updates – Yahoo Sports

December 10, 2025
Egypt protests planned pride celebrations for World Cup game vs. Iran – USA Today

Egypt Protests Planned Pride Celebrations Ahead of World Cup Clash with Iran

December 10, 2025
U.S. Economy Shows Mixed Signals Ahead of Likely Fed Cut – Russell Investments – Commentaries – Advisor Perspectives

U.S. Economy Shows Conflicting Signs as Fed Considers Possible Rate Cut

December 10, 2025
What Netflix’s Acquisition of Warner Bros. Means for the Movies – WKTV

How Netflix’s Acquisition of Warner Bros. Is Set to Revolutionize the Future of Movies

December 10, 2025
Negative health impacts caused by ‘forever chemicals’ linked to billions in economic losses – News-Medical

The Hidden Cost of ‘Forever Chemicals’: How Toxic Pollution is Draining Billions from Our Economy

December 10, 2025

Categories

Archives

December 2025
M T W T F S S
1234567
891011121314
15161718192021
22232425262728
293031  
« Nov    
Earth-News.info

The Earth News is an independent English-language daily published Website from all around the World News

Browse by Category

  • Business (20,132)
  • Ecology (962)
  • Economy (980)
  • Entertainment (21,856)
  • General (18,657)
  • Health (10,020)
  • Lifestyle (992)
  • News (22,149)
  • People (986)
  • Politics (993)
  • Science (16,195)
  • Sports (21,481)
  • Technology (15,962)
  • World (968)

Recent News

II. Capitalism and Ecology: The Nature of the Contradiction – Monthly Review

The Clash Between Capitalism and Ecology: Unraveling the Core Contradiction Rewritten title: When Profit Meets Planet: Exploring the Deep Conflict Between Capitalism and Ecology

December 10, 2025
Thank goodness for a Free Press and Science – Marler Blog

Grateful for the Power of a Free Press and Science

December 10, 2025
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2023 earth-news.info

No Result
View All Result

© 2023 earth-news.info

No Result
View All Result

© 2023 earth-news.info

Go to mobile version