Correspondence
Published: 01 May 2024
Nature Biotechnology
(2024)Cite this article
Many of Jennifer Listgarten’s arguments are compelling: in particular, that the protein folding problem is an outlier relative to other grand challenges in science, both in terms of the precise way the problem can be stated and performance measured and in terms of the amount of available, high quality data1. However, although existing biological databases tend to be small relative to the compendia used to train large language models, it seems plausible that one type of biological data — whole genome sequencing — will soon be generated at massive scales, opposite to what was argued1. As genome sequencing costs go down and the potential for clinical use of genomic data goes up, it will make economic sense to fully sequence everyone. Each 3 billion base-pair individual genome can be represented as 30 million unique bases, so fully sequencing the US population of 300 million individuals yields a total of 9 × 1015 bases, which is comparable in size to the 400-terabyte Common Crawl dataset used to train large language models. Using such data to train large-scale machine learning models will be challenging because of privacy considerations. Nonetheless, I see at least four paths where such models could be built on massive genomic data.
The first path involves federated data access. A federated approach uses software to enable multiple databases to function as one, facilitating interoperability while maintaining autonomy and decentralization2. Federation capabilities are supported by existing genomic biobanks, such as the UK Biobank, NIH All of Us and Finland’s FinnGen initiative3, and are further facilitated by commercial entities such as lifebit.ai. In a federated approach, a deep learning model can be trained from data drawn from multiple biobanks while maintaining privacy guarantees.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
24,99 € / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
209,00 € per year
only 17,42 € per issue
Buy this article
Purchase on Springer LinkInstant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Additional access options:
Log in
Learn about institutional subscriptions
Read our FAQs
Contact customer support
References
Listgarten, J. Nat. Biotechnol.42, 371–373 (2024).
Article
CAS
PubMed
Google Scholar
Alvarellos, M. et al. Front. Genet.13, 1045450 (2023).
Article
PubMed
PubMed Central
Google Scholar
Global Alliance for Genomics and Health. Science352, 1278–1280 (2016).
Article
Google Scholar
Gubar, S. A cancer researcher takes cancer personally. The New York Times (15 February 2018).
Buolamwini, J. & Gebru, T. Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency 77–91 (PMLR, 2018).
Download references
Author information
Authors and Affiliations
Department of Genome Sciences, University of Washington, Seattle, WA, USA
William Stafford Noble
Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
William Stafford Noble
Corresponding author
Ethics declarations
Competing interests
The author declares no competing interests.
About this article
Cite this article
Noble, W.S. Response to “The perpetual motion machine of AI-generated data and the distraction of ChatGPT as a ‘scientist’”.
Nat Biotechnol (2024). https://doi.org/10.1038/s41587-024-02230-2
Download citation
Published: 01 May 2024
DOI: https://doi.org/10.1038/s41587-024-02230-2
>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Nature.com – https://www.nature.com/articles/s41587-024-02230-2