Consistent Subset Sampling

DOI10.1007/978-3-319-08404-6_26zbMATH Open1417.68142arXiv1404.4693OpenAlexW2131999824MaRDI QIDQ3188904

Publication date: 2 September 2014

Published in: Algorithm Theory – SWAT 2014 (Search for Journal in Brave)

Abstract: Consistent sampling is a technique for specifying, in small space, a subset

S

of a potentially large universe

U

such that the elements in

S

satisfy a suitably chosen sampling condition. Given a subset

m a t h c a l I s u b s e t e q U

it should be possible to quickly compute

m a t h c a l I c a p S

, i.e., the elements in

m a t h c a l I

satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size-

k

subsets occurring in some set in a collection of sets of bounded size

b

, where

k

is a small integer. This can be done by applying standard consistent sampling to the

k

-subsets of each set, but that approach requires time

T h e t a (b^{k})

. Using a carefully designed hash function, for a given sampling probability

p i n (0, 1]

, we show how to improve the time complexity to

T h e t a (b^{l c e i l k / 2 c e i l} l o g l o g b + p b^{k})

in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is

T h e t a (b^{l c e i l k / 4 c e i l})

. We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent

k

-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining.

Full work available at URL: https://arxiv.org/abs/1404.4693

Mathematics Subject Classification ID

Theory of data (68P99) Probability in computer science (algorithm analysis, random structures, phase transitions, etc.) (68Q87)