The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

10:15 AM - 10:30 AM JST(1:15 AM - 1:30 AM UTC)

[5O1-IS-1-06]Memory Efficient PagedAttention with Page Sharing

〇Yifeng Shen¹, Hideyuki Kawashima¹ (1. Keio University)

work-in-progress

Keywords:

PagedAttention,LLM,AI

With the increased number of requests new LLM systems have to batch process, an efficient memory system is required. PagedAttention eliminates fragmentation inside memory by leveraging the paging technique commonly found in computer operating systems. It also features functionalities like prefix sharing to enable reuse of shared prefixes within requests. However, with the increased amount of application-specific LLM usage, there will be similar components both within the prompt and the response which are not limited to prefixes. Therefore, we propose an addition to PagedAttention that leverages memory sharing whenever possible to maximize the effectiveness of memory usage. Our proposed algorithm shows significant reduction in total memory usage when many similar requests are being processed, and adds only minimal overhead in other common use cases.

Back to Session information