Worklog 2015-03-06

Rustlog

I have recently pushed three pull requests into Rust: #22839, #22971 and #23060. Each has its own significance, but today I'll discuss about last two PRs primarily because they strictly remove lots of redundant data from the Rust distribution.

Anatomy of Rust Distribution

Let's look at 2015-03-03 nightly, which is the last nightly not affected by two PRs. It is 144.8 MB after gzipped (Note: this is 144.8 times 106 bytes, to be exact), and 634.5 MB before compression. More precisely:

Ways to Enlightenment (or sorta)

There are several ways to shrink the tarball.

First, we observe that the docs are technically same contents as the shipped libraries. That would mean, if possible, generating docs from the compiled library on the fly will completely remove docs from the distribution! #19606 was my initial attempt to do that, but it had several obstacles I wasn't able to tackle on time.

There are some alternative solutions with docs:

  1. Move everything to JavaScript with an optimized format. The "HTML" documentation will actually be a huge web application that renders the page.

  2. As a variation of 1, make it a Rust web server. If we don't care about bundling Hyper with Rust, why not.

  3. Remove compiler docs and keep others as is. This might be a good trade-off as many end users won't bother looking at them.

  4. Remove source codes from the docs. (#23601) We already have a separate tarball for source codes, so let users download them if they really want. Of course, this alone won't make much effect.

Second, we can shrink the metadata. The metadata was based on the EBML but we had very, very different use cases compared to Matroska:

The main direction on metadata would be either ditching EBML or incrementally improving that. This was a subject of 2.5 year old bug, and I always had an eye on that.

Third, we can have only one copy of metadata per crate. This calls for two main prerequisites: rustc should be stable enough that we can combine rustc libraries and user code libraries, and rustc should have a knowledge of external metadata, which would look similar to Windows PDB file. That'd be a huge undertaking though.

Lastly, we can simply switch to something better than gzip. Gzip is very old algorithm: it relies on two-score-old modelling algorithm—LZ77—and suboptimal coding algorithm—skewed Huffman tree. Its use of LZ77 is also suboptimal, as the matching window is limited in 64 KB and anything beyond that won't be deduplicated. The existing suggestion was to use xz for tarballs (#21724) and Snappy or LZ4 for metadata (#6902). They should use different algorithms as metadata should decompress quickly.

Achievements

So what have I done? Warning: Bragging follows.

The first PR, #22971, changes the metadata encoding to greatly reduce the inherent overhead of EBML. (In fact, it is now completely different from EBML!) I kept somewhat debatable nature of navigable serialization format, which needs schema for complete decoding but is enough self-structured that can be inspected without much effort. @eddyb told me that he really wants to get rid of that nature, and I guess the future PRs would address that.

The second PR, #23060, is very cost-effective one. We all know that compiler crates are large, but it becomes suspicious when docs for librustc are four times larger than those for libstd while libstd actually has more code than librustc (!). This ultimately traced to the quadratic growth of sidebars: When the module contains N items, there would be N sidebars with N items each. librustc had a large LLVM binding module, which caused a huge bloat. The solution was to move them into a shared JavaScript file per module.

Was that effective? I think so. In fact, if my measurement is correct, the updated tarball should be 35 MB smaller than the original, and the uncompressed size should be halved. Note that at the time of writing, the first PR has been already deployed and that resulted in universal 10 MB decrease in size. I too was surprised at the numbers, as my initial goal was just to reduce some 30% of metadata. Actually I was able to reduce 30% of entire tarball. Great.

There are still many possible improvements on the distribution size. I welcome any suggestion, concrete proposal or implementation; I hope this post to motivate anyone interested in this task.