Presentation Information

[5O1-IS-1-06]Memory Efficient PagedAttention with Page Sharing

〇Yifeng Shen1, Hideyuki Kawashima1 (1. Keio University)
work-in-progress

Keywords:

PagedAttention,LLM,AI

With the increased number of requests new LLM systems have to batch process, an efficient memory system is required. PagedAttention eliminates fragmentation inside memory by leveraging the paging technique commonly found in computer operating systems. It also features functionalities like prefix sharing to enable reuse of shared prefixes within requests. However, with the increased amount of application-specific LLM usage, there will be similar components both within the prompt and the response which are not limited to prefixes. Therefore, we propose an addition to PagedAttention that leverages memory sharing whenever possible to maximize the effectiveness of memory usage. Our proposed algorithm shows significant reduction in total memory usage when many similar requests are being processed, and adds only minimal overhead in other common use cases.