Presentation Information
[1Yin-A-35]Evaluating LLM-based decision to resolve false "To be published" literature metadata in the Protein Data Bank
〇Koya Sakuma1, Satomi Niwa2 (1. Nagoya University, 2. The University of Osaka)
Keywords:
Protein Data Bank,Data–literature Linking,Structural Biology,Large Language Models (LLMs),Metadata Completion
We are currently conducting the PDB-Descriptome project, which aims to establish detailed pairings between protein three-dimensional structures deposited in the Protein Data Bank (PDB) and their corresponding descriptions in the scientific literature. However, a substantial number of entries in the PDB remain without updated literature link information, despite the publication of structural biology papers that originally reported their three-dimensional structures (designated as Primary Citations in PDB nomenclature). Although each PDB entry is assigned a unique PDB ID, the mere mention of a PDB ID of interest in a publication does not necessarily indicate a Primary Citation, as such references may serve alternative purposes such as structural comparisons. Consequently, determination of Primary Citation status through pattern matching or similar approaches is considered problematic. In the present study, we investigated whether it is feasible to determine Primary Citation status using large language models to analyze the context surrounding exact matches of the relevant PDB ID, focusing on PDB entries with known Primary Citations for which full-text articles are available in PubMed Central. We conclude that Primary Citations can be precisely identified among papers mentioning the relevant PDB IDs through LLM-based decision making.
