Rustlog : Worklog 2015-03-06

I have recently pushed three pull requests into Rust: #22839, #22971 and #23060. Each has its own significance, but today I'll discuss about last two PRs primarily because they strictly remove lots of redundant data from the Rust distribution.

Anatomy of Rust Distribution

Let's look at 2015-03-03 nightly, which is the last nightly not affected by two PRs. It is 144.8 MB after gzipped (Note: this is 144.8 times 10⁶ bytes, to be exact), and 634.5 MB before compression. More precisely:

rustc directory contains a main Rust distribution, and is 296.3 MB and 101.8 MB before and after compression.
- Non-library files have negligible size (less than a megabyte).
- We ship at most three copies of files per crate: Shared libraries from snapshot compiler, shared libraries for the compiled Rust executable, and optionally static counterparts of them.
  
  We can omit static libraries from the distribution. This completely removes the ability to statically link, but this would remove 48.1 MB and 11.3 MB of binary before and after compression.
- As previously mentioned, metadata takes much space. Some of them are compressed with gzip and actually does not compress well in the tarball. Some of them, prominently in static libraries, are not compressed so that they can be conveniently mmapped.
  
  Static libraries have 93.4 MB of uncompressed metadata. Shared libraries have 32.7 MB of compressed metadata. Assuming that they have same compression ratio as the overall directory (and that already compressed metadata do not compress at all), they will take about 65 MB in the gzipped tarball.
rust-docs directory contains generated HTML documentation (which includes compiler crates, as in make compiler-docs) and is 313.3 MB and 40.0 MB before and after compression.
- There are particularly large crates: rustc, rustc_trans, std, rustc_typeck and rustc_lint crates take 218.6 MB and 30.6 MB before and after compression.
- If we actually try to remove compiler crates, that will remove at least 204.6 MB and 29.4 MB before and compression. Why "at least"? That will also remove files from other directories, but that's cumbersome to quantify.
- The HTML documentation also essentially reproduces a public portion of each crate's source code, which takes 23.8 MB and 5.4 MB before and after compression.
cargo directory contains (of course) Cargo, and is 8.9 MB and 3.4 MB before and after compression.

Ways to Enlightenment (or sorta)

There are several ways to shrink the tarball.

First, we observe that the docs are technically same contents as the shipped libraries. That would mean, if possible, generating docs from the compiled library on the fly will completely remove docs from the distribution! #19606 was my initial attempt to do that, but it had several obstacles I wasn't able to tackle on time.

There are some alternative solutions with docs:

Move everything to JavaScript with an optimized format. The "HTML" documentation will actually be a huge web application that renders the page.
As a variation of 1, make it a Rust web server. If we don't care about bundling Hyper with Rust, why not.
Remove compiler docs and keep others as is. This might be a good trade-off as many end users won't bother looking at them.
Remove source codes from the docs. (#23601) We already have a separate tarball for source codes, so let users download them if they really want. Of course, this alone won't make much effect.

Second, we can shrink the metadata. The metadata was based on the EBML but we had very, very different use cases compared to Matroska:

We don't have lots of tags. EBML tags are quite general (and tuned) because each tag essentially is a unique identifier and there can be lots of them in one file. Rust metadata is different: we have about one hundred or so specialized tags and some dozens for auto-serialized AST.
We don't have a proper schema. As mentioned above, auto-serialized AST is almost incomprehensible without the exact knowledge of data structure. Still, somehow, we can navigate through that. There are primarily two classes of serialization formats, schematic (like protobuf) and schemaless (like JSON), and our use case is neither of them.
We don't perform relaxation. That is, we don't encode the length optimally, as we generally don't know the length of the following data in advance. This is also a big limitation of EBML: it does not have an indefinitely-sized container. Moreover, we occasionally refer to other portion of encoded metadata by a simple byte offset. This makes the proper relaxation quite hard.
EBML itself does not define an encoding for primitive values (not uncommon with schematic serialization formats), and our ad-hoc encoding for them is very inefficient. For example, we have used 4 bytes for values that occur some hundred thousand times and is less than 100.

The main direction on metadata would be either ditching EBML or incrementally improving that. This was a subject of 2.5 year old bug, and I always had an eye on that.

Third, we can have only one copy of metadata per crate. This calls for two main prerequisites: rustc should be stable enough that we can combine rustc libraries and user code libraries, and rustc should have a knowledge of external metadata, which would look similar to Windows PDB file. That'd be a huge undertaking though.

Lastly, we can simply switch to something better than gzip. Gzip is very old algorithm: it relies on two-score-old modelling algorithm—LZ77—and suboptimal coding algorithm—skewed Huffman tree. Its use of LZ77 is also suboptimal, as the matching window is limited in 64 KB and anything beyond that won't be deduplicated. The existing suggestion was to use xz for tarballs (#21724) and Snappy or LZ4 for metadata (#6902). They should use different algorithms as metadata should decompress quickly.

Achievements

So what have I done? Warning: Bragging follows.

The first PR, #22971, changes the metadata encoding to greatly reduce the inherent overhead of EBML. (In fact, it is now completely different from EBML!) I kept somewhat debatable nature of navigable serialization format, which needs schema for complete decoding but is enough self-structured that can be inspected without much effort. @eddyb told me that he really wants to get rid of that nature, and I guess the future PRs would address that.

The second PR, #23060, is very cost-effective one. We all know that compiler crates are large, but it becomes suspicious when docs for librustc are four times larger than those for libstd while libstd actually has more code than librustc (!). This ultimately traced to the quadratic growth of sidebars: When the module contains N items, there would be N sidebars with N items each. librustc had a large LLVM binding module, which caused a huge bloat. The solution was to move them into a shared JavaScript file per module.

Was that effective? I think so. In fact, if my measurement is correct, the updated tarball should be 35 MB smaller than the original, and the uncompressed size should be halved. Note that at the time of writing, the first PR has been already deployed and that resulted in universal 10 MB decrease in size. I too was surprised at the numbers, as my initial goal was just to reduce some 30% of metadata. Actually I was able to reduce 30% of entire tarball. Great.

There are still many possible improvements on the distribution size. I welcome any suggestion, concrete proposal or implementation; I hope this post to motivate anyone interested in this task.