Presentation Information

[5L1-OS-30-04]Comprehensive Benchmark Evaluation of AI Agent Capabilities in Scientific TasksCross-Benchmark Analysis and Visualization of the Current State of AI for Science

〇Takayuki Suzuki1, Ryota Yamada1 (1. Science Aid.inc.)

Keywords:

AI for science,AI Scientist,Benchmark Evaluation

This study aims to systematically evaluate the capabilities of LLMs and AI agents on scientific tasks and provide objective evidence for assessing the reliability of LLMs and LLM-based systems.
As an initial step, we evaluated three LLMs and a custom AI agent (BixBench-Modified-Agent) using four publicly available scientific benchmarks: GPQA, FrontierScience, LAB-Bench, and BixBench.Results showed that Gemini-3-flash-preview achieved the highest performance across all benchmarks. On BixBench, an agent-based benchmark, the accuracy remained at 46.8% even when powered by Gemini-3-flash-preview, indicating that complex life science data analysis remains challenging. Notable differences in response strategies were also observed: Gemini achieved both high Coverage (98.3%) and high Precision (68.9%), whereas other models tended to decline to answer, although this did not lead to improved Precision.
These findings help researchers determine which tasks can be delegated to LLMs and LLM-based systems. Future work will continuously expand benchmarks, LLMs, and AI agents toward a comprehensive visualization of AI for Science.

Comment

To browse or post comments, you must log in.Log in