Consistent Subset Sampling
From MaRDI portal
Publication:3188904
DOI10.1007/978-3-319-08404-6_26zbMATH Open1417.68142arXiv1404.4693OpenAlexW2131999824MaRDI QIDQ3188904
Rasmus Pagh, Konstantin Kutzkov
Publication date: 2 September 2014
Published in: Algorithm Theory – SWAT 2014 (Search for Journal in Brave)
Abstract: Consistent sampling is a technique for specifying, in small space, a subset of a potentially large universe such that the elements in satisfy a suitably chosen sampling condition. Given a subset it should be possible to quickly compute , i.e., the elements in satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size- subsets occurring in some set in a collection of sets of bounded size , where is a small integer. This can be done by applying standard consistent sampling to the -subsets of each set, but that approach requires time . Using a carefully designed hash function, for a given sampling probability , we show how to improve the time complexity to in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is . We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent -itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining.
Full work available at URL: https://arxiv.org/abs/1404.4693
Theory of data (68P99) Probability in computer science (algorithm analysis, random structures, phase transitions, etc.) (68Q87)
Related Items (2)
This page was built for publication: Consistent Subset Sampling
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q3188904)