MemTable

MemTable in LSM is the in-memory representation of key-value pairs.

It can be considered as a BTreeMap.

There is only one active MemTable for every openkv instance. Memtable is shared by all sub-LSM

Flush

Flush MemTable to SSTable.

If the MemTable becomes too large, flush one continous portion(span) into a sub-LSM.

MemTable always flushes the biggest span, e.g., the one with most keys, unless there is a pressure to reclaim WAL space.

SSTable

A SSTable(solid-state table) is the same concept used in levelDB or rocksDB.

The size of an SSTable can only be 4MB(for tombstone only SSTable) or 16MB(regular SSTable).

The data layout inside an SSTable:

Header is fix-sized and contains an SSTable format version etc.

Metas

Metas stores configs of this SSTable, such as key types, value types, etc. E.g., a fix-sized key or value may use a type-specific store format.

Fix-sized key does not need to store key-suffix in Records segment. TODO: prove it.

Metas also stores the cardinality signature of all keys to optimize SSTable compaction.

Index

Index is a trie built from all keys stored in this SSTable, without single-branch tail. The leaf node in the trie stored the corresponding offset of a record in Records.

It contains enough info to locate a present key in the Records segment. But there could be a false-positive for a key not in Records.

For a given example of 3 key values:

fantastice: 1
foo: 5
food: 3

The Index would be like this:

Caching

Index is preferred to reside in memory, while the Records is left on disk.

Not all keys in the Index has a corresponding offset stored for it. Since the IO optimization should be IO-oriented, reading several bytes may cost almost the same resource as reading several kilo-bytes. Thus the Index only need to store an offset for several adjacent keys.

And we just need to limit the distance between two adjacent offsets in the Index to be less than a expected value, e.g., 16 KB.

Records

The Records segment stores key suffixes that are not included in Index and value.

Records is indexed by offsets that are stored in Index.

key-suffix may be empty.
seq is a monotonic sequence number.
value is var-len serialized bytes.

Checksum

SHA256 of all preceding bytes in the SSTable.

Positive and Negative SSTable

One flush builds 2 SSTable:

One of them contains all inserted or updated records, the Positive SSTable.
The other one contains only deleted records, i.e., tomobstones: the Negative SSTable.

Both has the same level and the same key range. In other word, we store updated recoreds and deleted records separately.

By separating P/N records, it is possible to compact a N-SSTable down to reclaim space quickly.

E.g., a Negative SSTable N1 can be pushed down and be merged with N2 and N3, without touching P1. Because to N1, P1 is transparent: N1 ∩ P1 = ø

Reading a record

Search the key in the Index. If it matches a path in the trie, scan Records from the offset upto next offset to find the record.
If the key-suffix also matches the search key, returns the value.

Performance considerations

Trie naturally compresses prefixes and an SSTable only stores a small range of keys thus the space cost is low enough.

For a static trie, succinct data structure can be adopted to reduce the size of the trie.

Our goal budget for a key is about several bytes. So that keeping the Index in memory is possible and unnecessary IO can be reduced.
Trie is not cache friendly thus a single trie should not be very large. The expected time cost for a query in the Index is less than 200 ns.

Virtual SSTable

VSSTable is a unit to manage SSTable. Because an SSTable may be referenced more than once: e.g. when splitting a span.

A SSTable is removed when there is no VSSTable referencing it.

A VSSTable has a key range that is subset of the SSTable it references.

struct VSSTable {
    key_start: String,
    key_end: String,
    sstable: SSTableId,
}

Separate value

Sharding

One of the performance issue about LSM tree is the write/read amplification. When a db becomes bigger, the number of levels(l) in a LSM becomes increases, in logarithm order.

A record will be rewritten(compaction) l times to enter the bottom level. Assumes the fanout of every level is n, every record amplifies write IO by O(l) * n times.

By splitting LSM into several smaller ones, l becomes smaller and the write/read amplification will be reduced.

Thus openkv organize its data in a way resembles to a two-level BTree:

The btree root node is a array of all sub-LSM, sorted in key order.
Each of the leaf nodes is a small LSM.

Sub-LSM

Sub-LSM is a small LSM tree, with a limited number of levels.

Split and Merge

A sub-LSM will be split if the level exceeds a threshold(e.g., 3).
Two adjacent sub-LSM is merged into one if both of them becomes lower than 1/3 of the threshold.

Compaction

Push a SSTable down and merge it with all overlapping SSTable at the next lower level.

Leave only the record with the greatest seq when merging.

When reaching the bottom level, i.e, level-0, a tombstone record can be removed for good.

Min-hash

min-hash is a signature of all keys in a SSTable and is used to optimize the compaction. E.g., find out a pair of SSTable with most common keys.

Jaccard similarity

Let U be a set and A and B be subsets of U, then the Jaccard index is defined to be the ratio of the number of elements of their intersection and the number of elements of their union:

$J (A, B) = \frac{∣ A \cap B ∣}{∣ A \cup B ∣} .$

Let h be a hash function that maps the members of U to distinct integers, e.g., let h(x) = SHA1(x).

For any set S, define H_min(S) to be the minimal integer(hash value) of members in S:

H_min(S) = min({h(x) | x ∈ S})

Now, applying H_min to both A and B, and assuming no hash collisions, the probability that H_min(A) == H_min(B) is true is equal to the similarity J(A,B):

Pr[ H_min(A) = H_min(B) ] = J(A,B)

Estimate the number of common keys

Given two SSTable, the number of common keys can be calculated with:

$c = \frac{p}{1 + p} (∣ A ∣ + ∣ B ∣)$

Estimate the Probability

The probability can be estimiated with y/k, where k is the number of different hash functions and y is the number of hash functions hᵢ for which hᵢ(A) == hᵢ(B)

Generate signature for a SSTable


#![allow(unused)]
fn main() {
fn gen_sstable_signature(sstable: &SSTable, number_of_hashes: usize) -> Vec<u64>{

    // Result signature
    let signature = Vec::with_capacity(number_of_hashes);

    // Distribute hash values to `k` buckets to simulate `k` hash functions.
    let buckets = Vec::with_capacity(number_of_hashes);

    for k in sstable.keys() {
        let h = SHA1(k) as u64;
        buckets[h%number_of_hashes].push(h);
    }

    for i in 0..number_of_hashes {
        signature[i] = min(buckets[i]);
    }

    signature
}
}

Calculate similarity of two SSTable


#![allow(unused)]

fn main() {
fn calc_sig(a: &SSTable, b:&SSTable) -> f64 {
    let eq = 0.0;
    for i in 0..number_of_hashes:
        if a.signature[i] == a.signature[i] {
            eq += 1
        }
    eq / number_of_hashes

}
}

Simulation

We provides a python script min-hash.py to estimate the accuracy of this algo.

Config	value
Number of hash functions(buckets)	128
Hash value	u64
Space cost	sizeof(u64) * 128 = 1KB

Actual vs Estimated:

\|A∪B\|	\|A\|	\|B\|	Actual (A∩B)/(A∪B)%	Estimated%	error%
1000	360	840	20.00%	21.88%	1.87%
1000	520	880	40.00%	38.28%	-1.72%
1000	680	920	60.00%	60.94%	0.94%
1000	839	959	80.16%	78.91%	-1.25%
1000	1000	1000	100.00%	100.00%	0.00%
10000	3600	8400	20.00%	15.62%	-4.38%
10000	5200	8800	40.00%	35.16%	-4.84%
10000	6800	9200	60.00%	60.94%	0.94%
10000	8399	9599	80.02%	85.16%	5.14%
10000	10000	10000	100.00%	100.00%	0.00%
100000	36000	84000	20.00%	21.88%	1.87%
100000	52000	88000	40.00%	47.66%	7.66%
100000	68000	92000	60.00%	62.50%	2.50%
100000	83999	95999	80.00%	80.47%	0.47%
100000	100000	100000	100.00%	100.00%	0.00%
1000000	360000	840000	20.00%	19.53%	-0.47%
1000000	520000	880000	40.00%	40.62%	0.62%
1000000	680000	920000	60.00%	58.59%	-1.41%
1000000	839999	959999	80.00%	75.78%	-4.22%
1000000	1000000	1000000	100.00%	100.00%	0.00%

Optimize compaction with min-hash

With min-hash we can find the best choice to compact.

For every SSTable, calculate the Jaccard index of it and the overlapping SSTable at the lower level.

Push down the one with the max Jaccard index.

The space cost is negligible. Only 1KB more space for every SSTable.
The time complexity is O(n), where n is the number of SSTable in the system. Because for every SSTable, there are about only k SSTable that have overlapping key range with it.where k is the level fanout.

DataFile

Multiple SSTable are stored in a large file, the DataFile.

DataFile only allocates or reclaims space by a fixed size: its allocation unit. There is one DataFile for each allocation unit.

Assuming in openkv the allocation unit are 4MB and 16MB, then there are two DataFiles: df-4mb and df-16mb.

Each DataFile has a corresponding bitmap for tracking allocated unit.

The bitmap is part of the manifest

Commit a SSTable

Find the first 0 in df-bitmap. This bitmap has to be indexed to speed up searching for the first 0.
Flush and fsync SSTable data to DataFile.
Flush and fsync bitmap in manifest.

Manifest

Manifest in openkv itself is a tiny db with WAL of operations to the system and a snapshot.

Snapshot:

version: String

data_file_bitmaps: {
    "4mb":  Bitmap
    "16mb": Bitmap
}

spans: [
    {
        start: String
        end: String
        vsstables: {
            0: [VSST1, VSST2, ...]
            1: [VSSTi, VSSTj, ...],
        }
    },
    ...
]

separated_values: [
    {
        start_value_id: String,
        end_value_id: String
        vsstables: {
            0: [VSST1, VSST2, ...]
            1: [VSSTi, VSSTj, ...],
        }

    },
    ...

]