Min-hash

min-hash is a signature of all keys in a SSTable and is used to optimize the compaction. E.g., find out a pair of SSTable with most common keys.

Jaccard similarity

Let U be a set and A and B be subsets of U, then the Jaccard index is defined to be the ratio of the number of elements of their intersection and the number of elements of their union:

Let h be a hash function that maps the members of U to distinct integers, e.g., let h(x) = SHA1(x).

For any set S, define H_min(S) to be the minimal integer(hash value) of members in S:

H_min(S) = min({h(x) | x ∈ S})

Now, applying H_min to both A and B, and assuming no hash collisions, the probability that H_min(A) == H_min(B) is true is equal to the similarity J(A,B):

Pr[ H_min(A) = H_min(B) ] = J(A,B)

Estimate the number of common keys

Given two SSTable, the number of common keys can be calculated with:

Estimate the Probability

The probability can be estimiated with y/k, where k is the number of different hash functions and y is the number of hash functions hᵢ for which hᵢ(A) == hᵢ(B)

Generate signature for a SSTable


#![allow(unused)]
fn main() {
fn gen_sstable_signature(sstable: &SSTable, number_of_hashes: usize) -> Vec<u64>{

    // Result signature
    let signature = Vec::with_capacity(number_of_hashes);

    // Distribute hash values to `k` buckets to simulate `k` hash functions.
    let buckets = Vec::with_capacity(number_of_hashes);

    for k in sstable.keys() {
        let h = SHA1(k) as u64;
        buckets[h%number_of_hashes].push(h);
    }

    for i in 0..number_of_hashes {
        signature[i] = min(buckets[i]);
    }

    signature
}
}

Calculate similarity of two SSTable


#![allow(unused)]

fn main() {
fn calc_sig(a: &SSTable, b:&SSTable) -> f64 {
    let eq = 0.0;
    for i in 0..number_of_hashes:
        if a.signature[i] == a.signature[i] {
            eq += 1
        }
    eq / number_of_hashes

}
}

Simulation

We provides a python script min-hash.py to estimate the accuracy of this algo.

Configvalue
Number of hash functions(buckets)128
Hash valueu64
Space costsizeof(u64) * 128 = 1KB

Actual vs Estimated:

|A∪B||A||B|Actual (A∩B)/(A∪B)%Estimated%error%
100036084020.00%21.88%1.87%
100052088040.00%38.28%-1.72%
100068092060.00%60.94%0.94%
100083995980.16%78.91%-1.25%
100010001000100.00%100.00%0.00%
100003600840020.00%15.62%-4.38%
100005200880040.00%35.16%-4.84%
100006800920060.00%60.94%0.94%
100008399959980.02%85.16%5.14%
100001000010000100.00%100.00%0.00%
100000360008400020.00%21.88%1.87%
100000520008800040.00%47.66%7.66%
100000680009200060.00%62.50%2.50%
100000839999599980.00%80.47%0.47%
100000100000100000100.00%100.00%0.00%
100000036000084000020.00%19.53%-0.47%
100000052000088000040.00%40.62%0.62%
100000068000092000060.00%58.59%-1.41%
100000083999995999980.00%75.78%-4.22%
100000010000001000000100.00%100.00%0.00%

Optimize compaction with min-hash

With min-hash we can find the best choice to compact.

For every SSTable, calculate the Jaccard index of it and the overlapping SSTable at the lower level.

Push down the one with the max Jaccard index.

  • The space cost is negligible. Only 1KB more space for every SSTable.

  • The time complexity is O(n), where n is the number of SSTable in the system. Because for every SSTable, there are about only k SSTable that have overlapping key range with it.where k is the level fanout.