Pareto-optimal data compression for binary classification tasks

From MaRDI portal
Publication:6324109

arXiv1908.08961MaRDI QIDQ6324109

Author name not available (Why is that?)

Publication date: 23 August 2019

Abstract: The goal of lossy data compression is to reduce the storage cost of a data set X while retaining as much information as possible about something (Y) that you care about. For example, what aspects of an image X contain the most information about whether it depicts a cat? Mathematically, this corresponds to finding a mapping XoZequivf(X) that maximizes the mutual information I(Z,Y) while the entropy H(Z) is kept below some fixed threshold. We present a method for mapping out the Pareto frontier for classification tasks, reflecting the tradeoff between retained entropy and class information. We first show how a random variable X (an image, say) drawn from a class Yin1,...,n can be distilled into a vector W=f(X)inmathbbRn1 losslessly, so that I(W,Y)=I(X,Y); for example, for a binary classification task of cats and dogs, each image X is mapped into a single real number W retaining all information that helps distinguish cats from dogs. For the n=2 case of binary classification, we then show how W can be further compressed into a discrete variable by binning W into bins, in such a way that varying the parameter sweeps out the full Pareto frontier, solving a generalization of the Discrete Information Bottleneck (DIB) problem. We argue that the most interesting points on this frontier are "corners" maximizing I(Z,Y) for a fixed number of bins m=2,3... which can be conveniently be found without multiobjective optimization. We apply this method to the CIFAR-10, MNIST and Fashion-MNIST datasets, illustrating how it can be interpreted as an information-theoretically optimal image clustering algorithm.




Has companion code repository: https://github.com/tailintalent/distillation








This page was built for publication: Pareto-optimal data compression for binary classification tasks

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q6324109)