Backing storage options to consider

  • Plain old filesystem (probably won’t scale with the level of caching granularity we might want to aspire to)

  • RocksDB / LevelDB

  • sled

    • Author recommends comparing against PingCAP’s TitanDB for the large blob workload:

      one DB that might be interesting to benchmark for your workload against sled is PingCAP’s titandb, which separates keys from values, and I think they might move values even less over time, so that could be one of the better options

      they use the WiscKey DB architecture of keeping keys in an LSM but values out of it so they can avoid moving values as often. this is the same approach used in the Badger DB for the Go ecosystem that is nice in some situations

    • TiKV for distributed stores would be interesting, though probably excessive (there’s not as much chance of conflict in a largely content‐addressed database)

  • LMDB

  • SQLite

Investigate which of these options would need an accompanying large blob storage system (certainly SQLite) and then don’t bother using those options because that would be way too much work. The sled author’s remarks on the topic:

the general reason why most DBs will suggest punting to a FS for that kind of thing is because over time, databases will tend to rearrange items internally in order to perform defragmentation etc…​ sled will actually just spill over to using a file itself, but as I’m writing this it will only split leaf nodes when they hit 17 children, so if you have 17 1gb values all clustered on one leaf, sled will need to read that whole 17gb file in before serving your data

having said that, it isn’t that much work to change how sled splits leaf nodes to increase this granularity, so each 1gb value would get its own file

DBs usually assume they can copy values over and over pretty cheaply, and for most DB’s this is indeed sort of a nightmare workload

but right now, sled only splits leafs when they exceed the 16 child limit. in general I want to make this more size based, but as of right now it’s up to 16 values per node, and this stuff happens per-node rather than per-key

anyway, the bottom line is, measure it 🙂

as the creator of the thing I’ll always be aware of ways of making it better, but maybe for you it’s already good enough

Reference format

0x00/uu:…

A universally unique random identifier.

0x01/b3:…

A BLAKE3 hash.

Extensible! Let’s hope we never need more than 255 hash functions.

Hash/UUID payload is 31 bytes so that the reference as a whole is 256 bits.

TODO: Come up with a name for uu:… identifiers that doesn’t clash with standard 128 bit UUIDs.

# Hash a constant
$ mew ref 1234
b3:…

# Import and hash a file
$ mew ref -f ./file
b3:…

# Hash a file without importing into the forest
$ mew ref -n -f ./file
b3:…

# Download and import a file
$ mew ref -f https://example.com/example-1.0.tar.gz
b3:…

# Generate a unique random identifier
$ mew ref -u
uu:…

Investigate network protocol/RPC options

gRPC and Cap’n Proto are the main contenders here. Maybe figure out if gRPC could be used with Cap’n Proto payloads, or hand‐roll something based on Cap’n Proto + QUIC/HTTP 3.

Sketch out the mewl language

Somewhere between a purely functional shell and Dhall; implementation tightly integrated with the build store interface.

Ensuring adherence to object‐capability principles is Very Important™.

Implementation concerns

Grammar

It’d be nice to specify the language grammar as a hybrid parser‐pretty printer. I’ve always wanted to do that.

I’d like to think about how to reconcile Roslyn‐style 1:1 concrete‐abstract syntax mapping that preserves comments and whitespace with Wadler‐style pretty-printing, which feels much more structured than just twiddling the whitespace fields according to a large library of rules to me.

Typed tree structure sketch

TreeSpec : Type
Dir : Map PathComponent TreeSpec → TreeSpec
Blob : TreeSpec
Executable : TreeSpec

Tree : TreeType → Type
get : (p : PathComponent) → ∀ts ⇒ Tree (Dir ts) → (e : p ∈ ts) ⇒ Tree (value e)
get? : PathComponent → ∀ts ⇒ Tree (Dir ts) → Option (∃t · Tree t)
execute : Tree Executable → …
execute? : ∀t ⇒ Tree t → Option …

— guaranteed to contain bin/hello, which you can execute (including
— from build rules), and share/doc/hello/README, a plain blob, but can
— also contain arbitrary other trees in addition
hello-tree : Tree (Dir {
  "bin" = Dir {"hello" = Executable},
  "share" = Dir {"doc" = Dir {"hello" = Dir {"README" = Blob}}},
})

— hello-tree ▻ get "bin" : Tree (Dir {"hello" = Executable})
— execute (hello-tree ▻ get "bin" ▻ get "hello") : …

Could probably use row types for this:

— essentially mapping from identifiers to the specified type
Row : Type → Type

⦃
  bin : Dir ⦃hello : Executable⦄,
  share : Dir ⦃doc : Dir ⦃hello : Dir ⦃README : Blob⦄⦄⦄,
⦄ : Row TreeSpec

Then we can get nicer bin : … syntax by punning on the fact that rows would also be used to specify record types.

Investigate Guix properly

and raid it for ideas.

Incremental computation for configuration management

Adapton seems like it should have insights that are applicable to modelling Ansible/Terraform‐style reconciliation of configuration with state in a pure system.

Toolchains and bootstrapping

Shiz’s LLVM + clang + LLD + elftoolchain + compiler-rt + libc++ Linux toolchain work is probably worth referencing.

Meta

Set up CI

Investigate Azure Pipelines and GitHub Actions.

Set up bors

This will probably be really annoying in the early stages of hacking, depending on the latency.

It would be good to integrate the bors setup with git-test to test all the commits of a pull request rather than just the HEAD.

Set up and require commit signing

See above, though tapping a YubiKey a few times when pushing to the public repository isn’t too bad.

Move to self‐hosted infrastructure

GitHub supports ICE, and it would be nice to have the root of trust for binary builds under our direct control.

This would require manually administering build and VCS machines, prevent the use of the existing bors implementation, and substantially increase the barrier to contribution, so it should be done carefully: ideally people would still be able to contribute via GitHub issues and pull requests and have them automatically mirrored to the self‐hosted infrastructure.

Prohibit force pushes

let’s not get ahead of ourselves here