Deprecated: $wgMWOAuthSharedUserIDs=false is deprecated, set $wgMWOAuthSharedUserIDs=true, $wgMWOAuthSharedUserSource='local' instead [Called from MediaWiki\HookContainer\HookContainer::run in /var/www/html/w/includes/HookContainer/HookContainer.php at line 135] in /var/www/html/w/includes/Debug/MWDebug.php on line 372
Consistent Subset Sampling - MaRDI portal

Consistent Subset Sampling

From MaRDI portal
Publication:3188904

DOI10.1007/978-3-319-08404-6_26zbMATH Open1417.68142arXiv1404.4693OpenAlexW2131999824MaRDI QIDQ3188904

Rasmus Pagh, Konstantin Kutzkov

Publication date: 2 September 2014

Published in: Algorithm Theory – SWAT 2014 (Search for Journal in Brave)

Abstract: Consistent sampling is a technique for specifying, in small space, a subset S of a potentially large universe U such that the elements in S satisfy a suitably chosen sampling condition. Given a subset mathcalIsubseteqU it should be possible to quickly compute mathcalIcapS, i.e., the elements in mathcalI satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size-k subsets occurring in some set in a collection of sets of bounded size b, where k is a small integer. This can be done by applying standard consistent sampling to the k-subsets of each set, but that approach requires time Theta(bk). Using a carefully designed hash function, for a given sampling probability pin(0,1], we show how to improve the time complexity to Theta(blceilk/2ceilloglogb+pbk) in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is Theta(blceilk/4ceil). We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent k-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining.


Full work available at URL: https://arxiv.org/abs/1404.4693






Related Items (2)






This page was built for publication: Consistent Subset Sampling

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q3188904)