ADR-00001: Use FastCDC, a well-documented content-defined chunking strategy¶

Field	Value
Status	accepted
Date	2023-12-23
Deciders	Mattias Jansson, Manuel Lang, Joshua Cohen
Consulted	Yuriy O'Donnell, Stefan Boberg

Context and Problem Statement¶

Lore currently has an inefficient chunking strategy which utilizes xxHash, stepping one byte at a time, hashing 64 bytes and inserting a chunk border if the hash matches a certain pattern using modulo operator. We want to use a well-defined and tested content-defined chunking strategy which can ideally be adopted in the greater engine ecosystem.

Decision Drivers¶

Performance
Deduplication ratio
Ease of adoption in greater ecosystem

Considered Options¶

Current algorithm (xxHash with modulo)
FastCDC

Decision Outcome¶

FastCDC, due to increased performance while maintaining deduplication ratio in test data sets.

Consequences¶

Good, because performance of chunking is increased

Confirmation¶

A test was setup where the UE5 main repo was committed to Lore and statistics on total execution time, chunk size distribution and deduplication ratio was output. Results as follows.

FastCDC¶

Command executed in 326s

Chunk distribution, interesting ranges (around expected chunk size and at chunk size threshold)
Bucket [39424-39680): 39486 (2.29%)
Bucket [65280-65536): 2910 (0.17%)

Deduplication ratios
Deduplicated chunks 573100/2809256: 20.40%
Deduplicated bytes 12267781682/238314614264: 5.15%

xxHash with modulo¶

Command executed in 372s

Chunk distribution, interesting ranges (around expected chunk size and at chunk size threshold)
Bucket [39424-39680): 51624 (3.08%)
Bucket [65280-65536): 2148 (0.13%)

Deduplication ratios
Deduplicated chunks 518552/2719374: 19.07%
Deduplicated bytes 12298549446/237621986814: 5.18%

Pros and Cons of the Options¶

FastCDC¶

FastCDC paper (USENIX ATC '16)

Good, because the hash is a rolling hash with shift, add and array lookup (fp << 1) + G[b]
Good, because reasonably easy to configure to use case with bitmasks for hash pattern matching
Good, because it's a well-documented algorithm that can be widely adopted
Bad, because algorithm depends on inherent magic numbers specific to each implementation

Current algorithm¶

xxHash of 64 bytes, step one byte at a time, modulo hash with magic number to find pattern

Good, because it can be tweaked to the exact use case
Bad, because it's slow to calculate hash and step one byte at a time
Bad, because it's Lore-specific and hard for the greater ecosystem to adopt