HomePublications and UpdatesUncategorizedGenBank2PubMed: Bridging Viral Genomic Data and the Scientific Literature with AI-Assisted Curation

GenBank2PubMed: Bridging Viral Genomic Data and the Scientific Literature with AI-Assisted Curation

Kaiming Tao, Jinru Zhou, Yimam Getaneh, Robert W. Shafer

https:doi=10.21203/rs.3.rs-6710551/v1

Abstract

Background: GenBank entries of pathogenetic viral sequences are typically annotated with host species and epidemiological metadata. However, linking these entries to their corresponding published studies remains labor-intensive. 

Methods: We developed GenBank2PubMed, a computation pipeline that integrates GenBank sequence data with metadata from published studies. The pipeline aggregates GenBank entries into submission sets based on shared authorship, title similarity, submission dates, and the sequential nature of their accession numbers. Using automated methods, including GPT-4, we linked these submission sets to relevant publications – a challenging task given that many GenBank entries lack citation references. The result is a database in which viral sequences are annotated by host, country, and year of isolation. We also conducted a systematic review to assess how frequently published studies reporting sequences included GenBank submissions. We applied GenBank2PubMed to three high-mortality viruses with outbreak potential: Crimean-Congo Hemorrhagic Fever (CCHF) virus, Lassa virus, and Nipah virus. 

Results: We identified 193 CCHF virus submission sets (4,754 entries), 78 Lassa virus sets (2,663 entries), and 34 Nipah virus sets (355 entries). Of these, 173 (CCHF), 64 (Lassa), and 31 (Nipah) were linked to published studies. Integration with publication data enriched the contextual and epidemiological metadata for each set. Additionally, our literature review found that 80.1% of CCHF, 86.6% of Lassa, and 87.5% of Nipah virus studies reporting sequences had corresponding GenBank submissions. GenBank submission sets and relational databases for each virus are available at https://hivdb.stanford.edu/genbank2pubmed/; the pipeline is available at https://github.com/hivdb/GenBankRefs. 

Conclusions: Creating submission sets facilitates the organization of GenBank data into browsable spreadsheets and queryable databases. GPT-4 contributed to linking GenBank entries with published studies and extracting metadata, although manual validation remained essential for accuracy. GenBank2PubMed represents a significant step toward integrating GenBank viral sequences with the scientific literature in which they are reported.

Leave a Reply

Your email address will not be published. Required fields are marked *

Want to Collaborate? Partner With Us

Whether you’re looking to co-develop research projects, host training sessions, or contribute to groundbreaking health studies, we welcome the opportunity to work together. Let’s make a difference — [Partner With Us] today!

Picsart_25-03-25_23-19-32-444

Elomi Health Research & Training LLC (EHRT) is a global health research consultancy and training firm dedicated to bridging the gap between research, data, and public health interventions.

Solutions For Your Industry

Contact Us

General Enquiries

General Enquiries