Quantcast
Channel: Hacker News
Viewing all articles
Browse latest Browse all 25817

Show HN: PDD – Probabilistic De-Duplication of Streams with Bloom Filters

$
0
0

README.md

Build Statuscodecov

Implementation of Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams as described by Suman K. Bera, Sourav Dutta, Ankur Narang, and Souvik Bhattacherjee.

This library seeks to provide a production-oriented library for probabilistically de-duplicating unbounded data streams in real-time streaming scenarios (i.e. Storm, Spark, Flink, and Samza) while utilizing a fixed bound on memory.

Accordingly, this library implements three novel Bloom Filter algorithms from the prior-mentioned paper all of which are shown to converge faster towards stability and to improve false-negative rates (FNR) by 2 to 300 times in comparison with Stable Bloom Filters.

Downloads

Maven

<dependency><groupId>com.github.jparkie</groupId><artifactId>pdd</artifactId><version>0.1.0</version></dependency>

Gradle

compile 'com.github.jparkie:pdd:0.1.0'

Usage

This library provides three implementations of a ProbabilisticDeDuplicator:

  1. Biased Sampling based Bloom Filter (BSBF).

  2. Biased Sampling based Bloom Filter with Single Deletion (BSBFSD).

  3. Randomized Load Balanced Biased Sampling based Bloom Filter (RLBSBF).

Basic

finallongNUM_BITS=8*8L*1024L*1024L;ProbabilisticDeDuplicator deDuplicator =null;// Creates a BSBFDeDuplicator with 8MB of RAM and false-positive probability at 3%.
deDuplicator =RLBSBFDeDuplicator.create(NUM_BITS, 0.03D);// Creates a BSBFDeDuplicator with 8MB of RAM and 5 hashing functions..
deDuplicator =newRLBSBFDeDuplicator(NUM_BITS, 5);// The number of bits that the ProbabilisticDeDuplicator should use.// Output: 67108864System.out.println(deDuplicator.numBits());// The number of hash functions that the ProbabilisticDeDuplicator should use.// Output: 5System.out.println(deDuplicator.numHashFunctions());// Probabilistically classifies whether a given element is a distinct or a duplicate element.// This operation does record the result into its history.// Output: trueSystem.out.println(deDuplicator.classifyDistinct("Hello".getBytes()));// Output: falseSystem.out.println(deDuplicator.classifyDistinct("Hello".getBytes()));// Probabilistically peeks whether a given element is a distinct or a duplicate element.// This operation does not record the result into its history.// Output: trueSystem.out.println(deDuplicator.peekDistinct("World".getBytes()));// Output: trueSystem.out.println(deDuplicator.peekDistinct("World".getBytes()));// Reset the history of the ProbabilisticDeDuplicator.
deDuplicator.reset();

Binary Serialization

PDD provides serializers for each ProbabilisticDeDuplicator implementation to write to and to read from a versioned binary format.

finalRLBSBFDeDuplicatorSerializer serializer =newRLBSBFDeDuplicatorSerializer();finalRLBSBFDeDuplicator deDuplicator =newRLBSBFDeDuplicator(64L, 1);finalRandom random =newRandom();finalbyte[] element =newbyte[128];
random.nextBytes(element);
assertTrue(deDuplicator.classifyDistinct(element));finalByteArrayOutputStream out =newByteArrayOutputStream();
serializer.writeTo(deDuplicator, out);
out.close();finalByteArrayInputStream in =newByteArrayInputStream(out.toByteArray());finalRLBSBFDeDuplicator serialized = serializer.readFrom(in);
in.close();
assertEquals(deDuplicator, serialized);

Java Serialization

PDD overrides the default object serialization for each ProbabilisticDeDuplicator implementation.

finalRLBSBFDeDuplicator deDuplicator =newRLBSBFDeDuplicator(64L, 1);finalRandom random =newRandom();finalbyte[] element =newbyte[128];
random.nextBytes(element);
assertTrue(deDuplicator.classifyDistinct(element));finalByteArrayOutputStream out =newByteArrayOutputStream();finalObjectOutputStream oos =newObjectOutputStream(out);
oos.writeObject(deDuplicator);
oos.close();
out.close();finalByteArrayInputStream in =newByteArrayInputStream(out.toByteArray());finalObjectInputStream ois =newObjectInputStream(in);finalRLBSBFDeDuplicator serialized = (RLBSBFDeDuplicator) ois.readObject();
ois.close();
in.close();
assertEquals(deDuplicator, serialized);

Build

$ git clone https://github.com/jparkie/PDD.git
$ cd PDD/
$ ./gradlew build

References

Bera, S.K., Dutta, S., Narang, A., Bhattacherjee, S.: Advanced Bloom filter based algorithms for efficient approximate data de-duplication in streams (2012)


Viewing all articles
Browse latest Browse all 25817

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>