Rustlog : Worklog 2014-12-06

I was continously working on #15309. Basically, Rustdoc has a link to the source code and some items have gotten their links incorrect. This is proved quite hard to solve, as I've outlined the cause, and I'm still figuring out how to solve that.

Wonderful world of metadata

It is a relatively hidden piece of the fact in Rust, but Rust compiler utilizes a metadata to track almost every information across multiple crates.

In fact, it is not hard to see the metadata's very existence. The metadata occupies a custom section in the excutable files or a self-explanatory file name rust.metadata.bin in rlibs. Any standard utilities on those files can be used to see them:

$ readelf /usr/local/rust/lib/libstd-4e7c5e5c.so -S | grep rustc -A1
  [25] .note.rustc       NOTE             00000000002ebaf0  000ebaf0
       00000000001da82a  0000000000000000  WA       0     0     16

Note that the metadata can be very huge. In this case, the section size is 0x1da82a bytes, i.e. about 1.9 megs. How much is that? It's over 60% of the entire executable if you ask that.

$ ls -al /usr/local/rust/lib/libstd-4e7c5e5c.so
-rw-r--r-- 1 arachneng 3197472 Dec  5 22:00 /usr/local/rust/lib/libstd-4e7c5e5c.so

Typical C/C++ compilers work on the simplitistic assumption, namely, they can mangle names to avoid any problem. So when you have a function named f with no argument and no return type, its name is (say) mangled into _Z1fv in some compilers and ?f@@YAXXZ in other compilers, so that they cannot be linked against each other. Some compilers (notably G++) explicitly require that there is some dummy symbol like __gxx_personality_v0 and otherwise the link fails.

This name mangling is actually a good practice. Different compilers commonly have different ABIs (Application Binary Interfaces), so mixing different ABIs in the same executable should be avoided. Many ABIs actually seem to be compatible but they often differ in details.

The problem is that, the name mangling gives too little information. If you have a linker error, well, you have a variety of options to try. You may have tried to link from G++ to VC++. You may have your function name incorrect. You may have... an extra namespace. (Namespaces surely affect the name mangling.) You may have some other type incorrect, on which your function depends. Perhaps, you shouldn't link to that function at all, since it's a template.

The metadata is a good complement to the name mangling. It has every type information to the public items, and (in the case of Rust) every trait implementation available. If the item has to be inlined (or generic), the contents of the item is also available to the metadata so that the compiler can inline its definition out of the original crate.

And at the expense of this improvement, the metadata structure is damn complex and the client code can go wrong.

Identifying Definitions

It seems that the metadata is nothing to do with the documentation. Unfortunately, #15309 is an example that the metadata is indeed important: it is hard to identify definitions across multiple crates.

Consider one example. The trait UnicodeChar is actually available in three places:

unicode::u_char::UnicodeChar (private)
unicode::char::UnicodeChar (reexported, public)
std::char::UnicodeChar (reexported again, public)

Since Rustdoc strips the private items, the primary documentation is available as unicode::char::UnicodeChar. Now, how does Rustdoc know of this reexport when generating the docs for std?

Incredibly the answer is no―Rustdoc doesn't know about the reexports! What it actually does is to simply link to the original definition and hope it correctly redirects to the inlined reexports. Sounds risky, huh? But this is what Rustdoc actually handles reexports. When Rustdoc has enough information to generate redirect pages(*), it will do; Otherwise it will use those redirect pages.

(*) I should mention that the metadata obviously contains the reexports. Still, Rustdoc only sees the original items precisely because the metadata decoder doesn't directly show the links from the reexport to the item. And the original items contain, alas, a path to the original items and not reexports.

Links to the source code pose another problem. The compiled crate itself doesn't have the source code, quite reasonably. In this matter, Rustdoc again relies on the prior incarnation of itself to generate suitable redirects. Now say that we have implemented #12932 and we have multiple source links in the single documentation page. How would you do identify the correct link to redirect?

The original code was using a unique identifier to each definition, generated shortly before the analysis phase ("phase 3"). This is called a NodeId (or its cross-crate version, DefId) and Rustdoc-generated pages will check if the gotosrc parameter contains a matching NodeId to redirect. This should have been correct unless the following were correct:

NodeIds change when you change the source code.

Well, this is primarily because NodeIds are used to identify a part of abstract syntax tree (AST). This is also why one cannot (normally) link to a recompiled crate (yet); the stable ABI would be awesome but it's a huge undertaking.

It seemed that #15309 only appears in the alloc crate. Ultimately this was because, rustc sees --cfg jemalloc but not rustdoc doesn't. This small option mostly affects the alloc crate, causing DefIds change.

Personally I learned lots about the metadata over the course of debugging, but only after I gave up and tried to make a custom metadata decoder in Python. Then I realized that the metadata in question had a skewed DefId and instantly looked at the Makefile. Dang. I really hope to see a proper analysis tool (#6912) for the metadata.

Possible Solutions

It's indeed hard to identify definitions across multiple crates! We don't have a definite solution, and can only outline general directions:

Rustdoc should not identify definitions via a DefId.
Rustdoc should be able to generate documentations from the compiled crate. (This is what I'm currently working on.)
In general, the metadata should have a good notion to uniquely identify the definition independently of the source code change.

Phew! This post went too far than I first imagined. Hope you enjoyed this.