Response to “The perpetual motion machine of AI-generated data and the distraction of ChatGPT as a ‘scientist’”

Response to “The perpetual motion machine of AI-generated data and the distraction of ChatGPT as a ‘scientist’”

Correspondence

Published: 01 May 2024

Nature Biotechnology

(2024)Cite this article

Many of Jennifer Listgarten’s arguments are compelling: in particular, that the protein folding problem is an outlier relative to other grand challenges in science, both in terms of the precise way the problem can be stated and performance measured and in terms of the amount of available, high quality data1. However, although existing biological databases tend to be small relative to the compendia used to train large language models, it seems plausible that one type of biological data — whole genome sequencing — will soon be generated at massive scales, opposite to what was argued1. As genome sequencing costs go down and the potential for clinical use of genomic data goes up, it will make economic sense to fully sequence everyone. Each 3 billion base-pair individual genome can be represented as 30 million unique bases, so fully sequencing the US population of 300 million individuals yields a total of 9 × 1015 bases, which is comparable in size to the 400-terabyte Common Crawl dataset used to train large language models. Using such data to train large-scale machine learning models will be challenging because of privacy considerations. Nonetheless, I see at least four paths where such models could be built on massive genomic data.

The first path involves federated data access. A federated approach uses software to enable multiple databases to function as one, facilitating interoperability while maintaining autonomy and decentralization2. Federation capabilities are supported by existing genomic biobanks, such as the UK Biobank, NIH All of Us and Finland’s FinnGen initiative3, and are further facilitated by commercial entities such as lifebit.ai. In a federated approach, a deep learning model can be trained from data drawn from multiple biobanks while maintaining privacy guarantees.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 print issues and online access

209,00 € per year

only 17,42 € per issue

Buy this article

Purchase on Springer LinkInstant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

Additional access options:

Log in

Learn about institutional subscriptions

Read our FAQs

Contact customer support

References

Listgarten, J. Nat. Biotechnol.42, 371–373 (2024).

Article 
CAS 
PubMed 

Google Scholar 

Alvarellos, M. et al. Front. Genet.13, 1045450 (2023).

Article 
PubMed 
PubMed Central 

Google Scholar 

Global Alliance for Genomics and Health. Science352, 1278–1280 (2016).

Article 

Google Scholar 

Gubar, S. A cancer researcher takes cancer personally. The New York Times (15 February 2018).

Buolamwini, J. & Gebru, T. Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency 77–91 (PMLR, 2018).

Download references

Author information

Authors and Affiliations

Department of Genome Sciences, University of Washington, Seattle, WA, USA

William Stafford Noble

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA

William Stafford Noble

Corresponding author

Correspondence to
William Stafford Noble.

Ethics declarations

Competing interests

The author declares no competing interests.

About this article

Cite this article

Noble, W.S. Response to “The perpetual motion machine of AI-generated data and the distraction of ChatGPT as a ‘scientist’”.
Nat Biotechnol (2024). https://doi.org/10.1038/s41587-024-02230-2

Download citation

Published: 01 May 2024

DOI: https://doi.org/10.1038/s41587-024-02230-2

>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Nature.com – https://www.nature.com/articles/s41587-024-02230-2

Exit mobile version