Skip to content

ADR-00001: Use FastCDC, a well-documented content-defined chunking strategy

Field Value
Status accepted
Date 2023-12-23
Deciders Mattias Jansson, Manuel Lang, Joshua Cohen
Consulted Yuriy O'Donnell, Stefan Boberg

Context and Problem Statement

Lore currently has an inefficient chunking strategy which utilizes xxHash, stepping one byte at a time, hashing 64 bytes and inserting a chunk border if the hash matches a certain pattern using modulo operator. We want to use a well-defined and tested content-defined chunking strategy which can ideally be adopted in the greater engine ecosystem.

Decision Drivers

  • Performance
  • Deduplication ratio
  • Ease of adoption in greater ecosystem

Considered Options

  • Current algorithm (xxHash with modulo)
  • FastCDC

Decision Outcome

FastCDC, due to increased performance while maintaining deduplication ratio in test data sets.

Consequences

  • Good, because performance of chunking is increased

Confirmation

A test was setup where the UE5 main repo was committed to Lore and statistics on total execution time, chunk size distribution and deduplication ratio was output. Results as follows.

FastCDC

Command executed in 326s

Chunk distribution, interesting ranges (around expected chunk size and at chunk size threshold)
Bucket [39424-39680): 39486 (2.29%)
Bucket [65280-65536): 2910 (0.17%)

Deduplication ratios
Deduplicated chunks 573100/2809256: 20.40%
Deduplicated bytes 12267781682/238314614264: 5.15%

xxHash with modulo

Command executed in 372s

Chunk distribution, interesting ranges (around expected chunk size and at chunk size threshold)
Bucket [39424-39680): 51624 (3.08%)
Bucket [65280-65536): 2148 (0.13%)

Deduplication ratios
Deduplicated chunks 518552/2719374: 19.07%
Deduplicated bytes 12298549446/237621986814: 5.18%

Pros and Cons of the Options

FastCDC

FastCDC paper (USENIX ATC '16)

  • Good, because the hash is a rolling hash with shift, add and array lookup (fp << 1) + G[b]
  • Good, because reasonably easy to configure to use case with bitmasks for hash pattern matching
  • Good, because it's a well-documented algorithm that can be widely adopted
  • Bad, because algorithm depends on inherent magic numbers specific to each implementation

Current algorithm

xxHash of 64 bytes, step one byte at a time, modulo hash with magic number to find pattern

  • Good, because it can be tweaked to the exact use case
  • Bad, because it's slow to calculate hash and step one byte at a time
  • Bad, because it's Lore-specific and hard for the greater ecosystem to adopt