LLMZip: Lossless Text Compression using Large Language Models

From MaRDI portal
Publication:6439454

arXiv2306.04050MaRDI QIDQ6439454

Author name not available (Why is that?)

Publication date: 6 June 2023

Abstract: We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. This estimate is significantly smaller than currently available estimates in cite{cover1978convergent}, cite{lutati2023focus}. A natural byproduct is an algorithm for lossless compression of English text which combines the prediction from the large language model with a lossless compression scheme. Preliminary results from limited experiments suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h.




Has companion code repository: https://github.com/vcskaushik/LLMzip








This page was built for publication: LLMZip: Lossless Text Compression using Large Language Models

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6439454)