EPub Bayreuth

Anmelden

Suche nach Personen

im Publikationsserver

bei Google Scholar

Bibliografische Daten exportieren

Benchmarking KV-Cache Optimizations across Task Quality and System Performance for Long-Context Serving [Experiment, Analysis & Benchmark]

DOI zum Zitieren der Version auf EPub Bayreuth: https://doi.org/10.15495/EPub_UBT_00009365
URN zum Zitieren der Version auf EPub Bayreuth: urn:nbn:de:bvb:703-epub-9365-2

Titelangaben

Agrawal, Nikita ; Mayer, Ruben:
Benchmarking KV-Cache Optimizations across Task Quality and System Performance for Long-Context Serving [Experiment, Analysis & Benchmark].
Bayreuth , 2026 . - 13 S.

Volltext

Format:	PDF
Name:	main.pdf
Version:	Preprint
Verfügbar mit der Lizenz	Creative Commons BY-NC-ND 4.0: Namensnennung, nicht kommerziell, keine Bearbeitung

Download (1MB)

Abstract

Large language model serving is increasingly limited by KV-cache growth under long-context workloads, yet existing KV-cache compression techniques are difficult to compare because they were evaluated on different models, tasks, budgets, and serving stacks. This paper presents a workload-aware benchmark of representative KV-cache optimization mechanisms spanning quantization, pruning, and merging, including KIVI, TurboQuant, SnapKV, and CaM, evaluated on LongBench-style multi-document QA, single-document QA, few-shot learning, and summarization workloads using Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3. The benchmark measures task quality, mean output throughput, mean time-to-first-token, and realized compression ratio across context-length buckets. The results show that the compression ratio alone is a poor predictor of end-to-end performance. KIVI4 provides the most stable quality across models, SnapKV delivers the strongest long-context throughput, and CaM yields large gains on selected QA workloads but exhibits substantial workload sensitivity in both quality and realized compression ratio. These findings motivate workload-aware selection of KV-cache mechanisms rather than one-size-fits-all compression and provide deployment guidance for long-context serving systems.

Weitere Angaben

Publikationsform:	Preprint, Postprint
Zusätzliche Informationen (öffentlich sichtbar):	Eingereicht bei: Proceedings of the VLDB Endowment ISSN 2150-8097
Keywords:	KV-caching; Memory-bound computing; LLMs; Long-context inference
Themengebiete aus DDC:	000 Informatik,Informationswissenschaft, allgemeine Werke > 004 Informatik
Institutionen der Universität:	Fakultäten > Fakultät für Mathematik, Physik und Informatik > Institut für Informatik > Lehrstuhl Angewandte Informatik X > Lehrstuhl Angewandte Informatik X - Univ.-Prof. Dr. Ruben Mayer Fakultäten Fakultäten > Fakultät für Mathematik, Physik und Informatik Fakultäten > Fakultät für Mathematik, Physik und Informatik > Institut für Informatik > Lehrstuhl Angewandte Informatik X Fakultäten > Fakultät für Mathematik, Physik und Informatik > Institut für Informatik
Sprache:	Englisch
Titel an der UBT entstanden:	Ja
URN:	urn:nbn:de:bvb:703-epub-9365-2
Eingestellt am:	09 Jun 2026 07:18
Letzte Änderung:	09 Jun 2026 07:18
URI:	https://epub.uni-bayreuth.de/id/eprint/9365

Download-Statistik

Downloads

Downloads pro Monat im letzten Jahr