README.md
Implementation of Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams as described by Suman K. Bera, Sourav Dutta, Ankur Narang, and Souvik Bhattacherjee.
This library seeks to provide a production-oriented library for probabilistically de-duplicating unbounded data streams in real-time streaming scenarios (i.e. Storm, Spark, Flink, and Samza) while utilizing a fixed bound on memory.
Accordingly, this library implements three novel Bloom Filter algorithms from the prior-mentioned paper all of which are shown to converge faster towards stability and to improve false-negative rates (FNR) by 2 to 300 times in comparison with Stable Bloom Filters.
Downloads
Maven
<dependency><groupId>com.github.jparkie</groupId><artifactId>pdd</artifactId><version>0.1.0</version></dependency>
Gradle
compile 'com.github.jparkie:pdd:0.1.0'
Usage
This library provides three implementations of a ProbabilisticDeDuplicator
:
Biased Sampling based Bloom Filter with Single Deletion (BSBFSD).
Randomized Load Balanced Biased Sampling based Bloom Filter (RLBSBF).
Basic
finallongNUM_BITS=8*8L*1024L*1024L;ProbabilisticDeDuplicator deDuplicator =null;// Creates a BSBFDeDuplicator with 8MB of RAM and false-positive probability at 3%.
deDuplicator =RLBSBFDeDuplicator.create(NUM_BITS, 0.03D);// Creates a BSBFDeDuplicator with 8MB of RAM and 5 hashing functions..
deDuplicator =newRLBSBFDeDuplicator(NUM_BITS, 5);// The number of bits that the ProbabilisticDeDuplicator should use.// Output: 67108864System.out.println(deDuplicator.numBits());// The number of hash functions that the ProbabilisticDeDuplicator should use.// Output: 5System.out.println(deDuplicator.numHashFunctions());// Probabilistically classifies whether a given element is a distinct or a duplicate element.// This operation does record the result into its history.// Output: trueSystem.out.println(deDuplicator.classifyDistinct("Hello".getBytes()));// Output: falseSystem.out.println(deDuplicator.classifyDistinct("Hello".getBytes()));// Probabilistically peeks whether a given element is a distinct or a duplicate element.// This operation does not record the result into its history.// Output: trueSystem.out.println(deDuplicator.peekDistinct("World".getBytes()));// Output: trueSystem.out.println(deDuplicator.peekDistinct("World".getBytes()));// Reset the history of the ProbabilisticDeDuplicator.
deDuplicator.reset();
Binary Serialization
PDD provides serializers for each ProbabilisticDeDuplicator
implementation to write to and to read from a versioned binary format.
finalRLBSBFDeDuplicatorSerializer serializer =newRLBSBFDeDuplicatorSerializer();finalRLBSBFDeDuplicator deDuplicator =newRLBSBFDeDuplicator(64L, 1);finalRandom random =newRandom();finalbyte[] element =newbyte[128];
random.nextBytes(element);
assertTrue(deDuplicator.classifyDistinct(element));finalByteArrayOutputStream out =newByteArrayOutputStream();
serializer.writeTo(deDuplicator, out);
out.close();finalByteArrayInputStream in =newByteArrayInputStream(out.toByteArray());finalRLBSBFDeDuplicator serialized = serializer.readFrom(in);
in.close();
assertEquals(deDuplicator, serialized);
Java Serialization
PDD overrides the default object serialization for each ProbabilisticDeDuplicator
implementation.
finalRLBSBFDeDuplicator deDuplicator =newRLBSBFDeDuplicator(64L, 1);finalRandom random =newRandom();finalbyte[] element =newbyte[128];
random.nextBytes(element);
assertTrue(deDuplicator.classifyDistinct(element));finalByteArrayOutputStream out =newByteArrayOutputStream();finalObjectOutputStream oos =newObjectOutputStream(out);
oos.writeObject(deDuplicator);
oos.close();
out.close();finalByteArrayInputStream in =newByteArrayInputStream(out.toByteArray());finalObjectInputStream ois =newObjectInputStream(in);finalRLBSBFDeDuplicator serialized = (RLBSBFDeDuplicator) ois.readObject();
ois.close();
in.close();
assertEquals(deDuplicator, serialized);
Build
$ git clone https://github.com/jparkie/PDD.git
$ cd PDD/
$ ./gradlew build
References
Bera, S.K., Dutta, S., Narang, A., Bhattacherjee, S.: Advanced Bloom filter based algorithms for efficient approximate data de-duplication in streams (2012)