MemTable
MemTable in LSM is the in-memory representation of key-value pairs.
It can be considered as a BTreeMap
.
There is only one active MemTable
for every openkv instance.
Memtable is shared by all sub-LSM
Flush
Flush MemTable to SSTable.
If the MemTable becomes too large, flush one continous portion(span) into a sub-LSM.
MemTable always flushes the biggest span, e.g., the one with most keys, unless there is a pressure to reclaim WAL space.
SSTable
A SSTable(solid-state table) is the same concept used in levelDB or rocksDB.
The size of an SSTable can only be 4MB
(for tombstone only SSTable) or 16MB
(regular SSTable).
The data layout inside an SSTable:
Header
Header
is fix-sized and contains an SSTable format version etc.
Metas
Metas
stores configs of this SSTable, such as key types, value types, etc.
E.g., a fix-sized key or value may use a type-specific store format.
Fix-sized key does not need to store key-suffix in Records
segment.
TODO: prove it.
Metas
also stores the cardinality signature of all keys to optimize SSTable compaction.
Index
Index
is a trie built from all keys stored in this SSTable, without
single-branch tail.
The leaf node in the trie stored the corresponding offset of a record in
Records
.
It contains enough info to locate a present key in the Records
segment.
But there could be a false-positive for a key not in Records
.
For a given example of 3 key values:
fantastice: 1
foo: 5
food: 3
The Index would be like this:
Caching
Index
is preferred to reside in memory, while the Records
is left on disk.
Sparse Index
Not all keys in the Index
has a corresponding offset stored for it.
Since the IO optimization should be IO-oriented, reading several bytes may cost
almost the same resource as reading several kilo-bytes.
Thus the Index
only need to store an offset for several adjacent keys.
And we just need to limit the distance between two adjacent offsets in the
Index
to be less than a expected value, e.g., 16 KB.
Records
The Records
segment stores key suffixes that are not included in Index
and
value.
Records
is indexed by offsets that are stored in Index
.
key-suffix
may be empty.seq
is a monotonic sequence number.value
is var-len serializedbytes
.
Checksum
SHA256 of all preceding bytes in the SSTable.
Positive and Negative SSTable
One flush builds 2 SSTable:
- One of them contains all inserted or updated records, the Positive SSTable.
- The other one contains only deleted records, i.e., tomobstones: the Negative SSTable.
Both has the same level and the same key range. In other word, we store updated recoreds and deleted records separately.
By separating P/N records, it is possible to compact a N-SSTable down to reclaim space quickly.
E.g., a Negative SSTable N1
can be pushed down and be merged with N2
and N3
,
without touching P1
. Because to N1
, P1
is transparent: N1 ∩ P1 = ø
Reading a record
-
Search the key in the
Index
. If it matches a path in the trie, scanRecords
from the offset upto next offset to find the record. -
If the key-suffix also matches the search key, returns the value.
Performance considerations
-
Trie naturally compresses prefixes and an SSTable only stores a small range of keys thus the space cost is low enough.
For a static trie, succinct data structure can be adopted to reduce the size of the trie.
Our goal budget for a key is about several bytes. So that keeping the
Index
in memory is possible and unnecessary IO can be reduced. -
Trie is not cache friendly thus a single trie should not be very large. The expected time cost for a query in the
Index
is less than 200 ns.
Virtual SSTable
VSSTable
is a unit to manage SSTable.
Because an SSTable may be referenced more than once:
e.g. when splitting a span.
A SSTable is removed when there is no VSSTable referencing it.
A VSSTable has a key range that is subset of the SSTable it references.
struct VSSTable {
key_start: String,
key_end: String,
sstable: SSTableId,
}
Separate value
Sharding
One of the performance issue about LSM tree is the write/read amplification.
When a db becomes bigger, the number of levels(l
) in a LSM becomes increases, in
logarithm order.
A record will be rewritten(compaction) l
times to enter the bottom level.
Assumes the fanout of every level is n
,
every record amplifies write IO by O(l) * n
times.
By splitting LSM into several smaller ones, l
becomes smaller and the
write/read amplification will be reduced.
Thus openkv organize its data in a way resembles to a two-level BTree:
- The btree root node is a array of all sub-LSM, sorted in key order.
- Each of the leaf nodes is a small LSM.
Sub-LSM
Sub-LSM is a small LSM tree, with a limited number of levels.
Split and Merge
-
A sub-LSM will be split if the level exceeds a threshold(e.g., 3).
-
Two adjacent sub-LSM is merged into one if both of them becomes lower than 1/3 of the threshold.
Compaction
Push a SSTable down and merge it with all overlapping SSTable at the next lower level.
Leave only the record with the greatest seq
when merging.
When reaching the bottom level, i.e, level-0, a tombstone record can be removed for good.
Min-hash
min-hash is a signature of all keys in a SSTable and is used to optimize the compaction. E.g., find out a pair of SSTable with most common keys.
Jaccard similarity
Let U
be a set and A
and B
be subsets of U
,
then the Jaccard index is defined to be the ratio of the number of elements of their intersection and
the number of elements of their union:
Let h
be a hash function that maps the members of U
to distinct integers, e.g., let h(x) = SHA1(x)
.
For any set S
, define H_min(S)
to be the minimal integer(hash value) of members in S
:
H_min(S) = min({h(x) | x ∈ S})
Now, applying H_min
to both A
and B
,
and assuming no hash collisions,
the probability that H_min(A) == H_min(B)
is true is equal to the similarity J(A,B)
:
Pr[ H_min(A) = H_min(B) ] = J(A,B)
Estimate the number of common keys
Given two SSTable, the number of common keys can be calculated with:
Estimate the Probability
The probability can be estimiated with y/k
,
where k
is the number of different hash functions
and y
is the number of hash functions hᵢ for which hᵢ(A) == hᵢ(B)
Generate signature for a SSTable
#![allow(unused)] fn main() { fn gen_sstable_signature(sstable: &SSTable, number_of_hashes: usize) -> Vec<u64>{ // Result signature let signature = Vec::with_capacity(number_of_hashes); // Distribute hash values to `k` buckets to simulate `k` hash functions. let buckets = Vec::with_capacity(number_of_hashes); for k in sstable.keys() { let h = SHA1(k) as u64; buckets[h%number_of_hashes].push(h); } for i in 0..number_of_hashes { signature[i] = min(buckets[i]); } signature } }
Calculate similarity of two SSTable
#![allow(unused)] fn main() { fn calc_sig(a: &SSTable, b:&SSTable) -> f64 { let eq = 0.0; for i in 0..number_of_hashes: if a.signature[i] == a.signature[i] { eq += 1 } eq / number_of_hashes } }
Simulation
We provides a python script min-hash.py to estimate the accuracy of this algo.
Config | value |
---|---|
Number of hash functions(buckets) | 128 |
Hash value | u64 |
Space cost | sizeof(u64) * 128 = 1KB |
Actual vs Estimated:
|A∪B| | |A| | |B| | Actual (A∩B)/(A∪B)% | Estimated% | error% |
---|---|---|---|---|---|
1000 | 360 | 840 | 20.00% | 21.88% | 1.87% |
1000 | 520 | 880 | 40.00% | 38.28% | -1.72% |
1000 | 680 | 920 | 60.00% | 60.94% | 0.94% |
1000 | 839 | 959 | 80.16% | 78.91% | -1.25% |
1000 | 1000 | 1000 | 100.00% | 100.00% | 0.00% |
10000 | 3600 | 8400 | 20.00% | 15.62% | -4.38% |
10000 | 5200 | 8800 | 40.00% | 35.16% | -4.84% |
10000 | 6800 | 9200 | 60.00% | 60.94% | 0.94% |
10000 | 8399 | 9599 | 80.02% | 85.16% | 5.14% |
10000 | 10000 | 10000 | 100.00% | 100.00% | 0.00% |
100000 | 36000 | 84000 | 20.00% | 21.88% | 1.87% |
100000 | 52000 | 88000 | 40.00% | 47.66% | 7.66% |
100000 | 68000 | 92000 | 60.00% | 62.50% | 2.50% |
100000 | 83999 | 95999 | 80.00% | 80.47% | 0.47% |
100000 | 100000 | 100000 | 100.00% | 100.00% | 0.00% |
1000000 | 360000 | 840000 | 20.00% | 19.53% | -0.47% |
1000000 | 520000 | 880000 | 40.00% | 40.62% | 0.62% |
1000000 | 680000 | 920000 | 60.00% | 58.59% | -1.41% |
1000000 | 839999 | 959999 | 80.00% | 75.78% | -4.22% |
1000000 | 1000000 | 1000000 | 100.00% | 100.00% | 0.00% |
Optimize compaction with min-hash
With min-hash we can find the best choice to compact.
For every SSTable, calculate the Jaccard index of it and the overlapping SSTable at the lower level.
Push down the one with the max Jaccard index.
-
The space cost is negligible. Only
1KB
more space for every SSTable. -
The time complexity is
O(n)
, wheren
is the number of SSTable in the system. Because for every SSTable, there are about onlyk
SSTable that have overlapping key range with it.wherek
is the level fanout.
DataFile
Multiple SSTable are stored in a large file, the DataFile.
DataFile only allocates or reclaims space by a fixed size: its allocation unit. There is one DataFile for each allocation unit.
Assuming in openkv the allocation unit are 4MB
and 16MB
,
then there are two DataFile
s: df-4mb
and df-16mb
.
Each DataFile has a corresponding bitmap for tracking allocated unit.
The bitmap is part of the manifest
Commit a SSTable
-
Find the first
0
in df-bitmap. This bitmap has to be indexed to speed up searching for the first0
. -
Flush and fsync SSTable data to DataFile.
-
Flush and fsync bitmap in manifest.
Manifest
Manifest in openkv itself is a tiny db with WAL of operations to the system and a snapshot.
Snapshot:
version: String
data_file_bitmaps: {
"4mb": Bitmap
"16mb": Bitmap
}
spans: [
{
start: String
end: String
vsstables: {
0: [VSST1, VSST2, ...]
1: [VSSTi, VSSTj, ...],
}
},
...
]
separated_values: [
{
start_value_id: String,
end_value_id: String
vsstables: {
0: [VSST1, VSST2, ...]
1: [VSSTi, VSSTj, ...],
}
},
...
]