SWOOP: top-k similarity joins over set streams

VLDB J. 2025;34(1):13. doi: 10.1007/s00778-024-00880-x. Epub 2024 Dec 23.

Abstract

We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams, such as Twitter streams that emit tweets as sets of words. Using a sliding window model, the top-k result changes as new sets enter the window or existing ones leave the window. Specifically, when a set arrives, it may form a new top-k result pair with any set already in the window. When a set leaves the window, all its pairings in the top-k result must be replaced with other pairs. It is therefore not sufficient to maintain the k most similar pairs since less similar pairs may become top-k pairs later. We propose SWOOP, a highly scalable stream join algorithm. Novel indexing techniques and sophisticated filters efficiently prune obsolete pairs as new sets enter the window. SWOOP incrementally maintains a provably minimal stock of similar pairs to update the top-k result at any time. Empirical studies confirm that SWOOP is able to support stream rates that are orders of magnitude faster than the rates supported by existing approaches.

Keywords: Set similarity; Streaming data; Top-k similarity join.