Worklog 2015-02-19

Rustlog

Long time no see. My daily job and other various distractions prevented me from doing a continuous work like blogging. Well, I finally finished Chrono 0.2 up so I have something to write down now.

In the previous post about Chrono, I discussed about a new time zone and offset handling of Chrono 0.2. This alone is a big change, but another big change in Chrono 0.2 is a new formatting and parsing API. Chrono 0.1 already had a rudimentary formatting API via format method, but it didn't have a parsing API which was another pain point besides from local date/time handling.

Chrono 0.2 has three different pieces of new APIs redesigned for formatting and parsing:

Altogether they form an advanced formatting facility in Chrono 0.2. I'll try to briefly discuss their designs and justifications.

Formatting Items

In Chrono, a formatting item is a unit of formatting or parsing. For example a strftime-like format string %Y-%m-%d has five different formatting items: %Y, -, %m, - and %d. Chrono decouples a formatting syntax from the actual meaning of formatting items, so they have the following (somewhat verbose) internal representations:

[Item::Numeric(Numeric::Year, Pad::Zero),
 Item::Literal("-"),
 Item::Numeric(Numeric::Month, Pad::Zero),
 Item::Literal("-"),
 Item::Numeric(Numeric::Day, Pad::Zero)]

This decoupling allows Chrono to support multiple formatting syntax, such as YYYY-MM-DD or Go-like 2006-01-02 instead. Also, Chrono can have "hidden" formatting items that can be used for internal purposes. RFC 2822 and 3339 support is implemented in this way.

The formatting item is a good abstraction, but every abstraction comes with a complexity. In the case of Chrono the complexity arises from the desire to avoid allocation. The number of formatting items is proportional to the length of format string in the worst case, so we cannot blindly collect items into a collection. Instead, Chrono returns an Iterator of formatting items and directly consumes that iterator for printing the date and time. Therefore the following identity holds:

assert_eq!(StrftimeItems::new("%Y-%m-%d").collect::<Vec<_>>(),
           [Item::Numeric(Numeric::Year, Pad::Zero),
            Item::Literal("-"),
            Item::Numeric(Numeric::Month, Pad::Zero),
            Item::Literal("-"),
            Item::Numeric(Numeric::Day, Pad::Zero)]);

I believe this redesign has a maximal flexibility while retaining an ability to avoid std at all.

Formatting with Items

Formatting with items is done with a new format_with_items method. The original format method is now a thin wrapper over that.

This part of Chrono remains relatively unchanged, but there are some notable small changes which deserve the explanation:

Parsing with Items

Okay, this is a fun part. Basically Chrono 0.2 has the following parsing algorithm:

  1. A fixed string parses as is: It has to appear in the input string as is.
  2. A sequence of one or more whitespace consumes zero or more whitespace.
  3. Most numeric items (Numeric::*) have a predefined parsing width, the maximal number of digits that can be consumed. They consume zero or more whitespace followed by one or more but limited number of digits. The exception is made to Numeric::Year and Numeric::IsoYear, which may accept an arbitrary number of digits when preceded by a sign.
  4. Fixed items (Fixed::*) have their own parsing logics, but normally does not consume preceding whitespace. They also normally ignore cases.
  5. At the end of the formatting items, the whole input string should have been consumed.

This is modelled after strptime's parsing algorithm, which seems to handle lots of corner cases. This allows a lax input like 2014-2-6 for a format string %Y-%m-%d. The fixed parsing width allows a format string like %Y%m%d which would accept strings like 20140206.

Parsing is daunting work, implemented with two rather big modules (chrono::format::{parse,scan}), but it's actually an easy part. Parsing only yields different date/time parts, which has to be merged into actual values via the resolution algorithm.

As an easy example, consider RFC 2822. RFC 2822 date and time format has a day of week part, which should be consistent to other date parts when specified. But strptime-based parse would happily accept inconsistent input:

>>> import time
>>> time.strptime('Wed, 31 Dec 2014 04:26:40 +0000',
                  '%a, %d %b %Y %H:%M:%S +0000')
time.struct_time(tm_year=2014, tm_mon=12, tm_mday=31,
                 tm_hour=4, tm_min=26, tm_sec=40,
                 tm_wday=2, tm_yday=365, tm_isdst=-1)
>>> time.strptime('Thu, 31 Dec 2014 04:26:40 +0000',
                  '%a, %d %b %Y %H:%M:%S +0000')
time.struct_time(tm_year=2014, tm_mon=12, tm_mday=31,
                 tm_hour=4, tm_min=26, tm_sec=40,
                 tm_wday=3, tm_yday=365, tm_isdst=-1)

Resolving date/time parts is littered with lots of corner cases, and that's why common date/time parsers do not correctly implement it; as far as I know, glibc, Python and JodaTime completely ignores the resolution. Therefore I'm glad to announce that Chrono 0.2 has a complete and known-to-be-correct resolution algorithm.

Chrono has a dedicated date/time part storage called Parsed. The resolution algorithm is hard to describe as is, but the relevant source code is relatively well-commented and worth reading if you are interested in the algorithm.

Chrono 0.3?

Maybe this might be an early assumption, but Chrono 0.2 is intended to be a stable base of the future API evolution. The very main design decision was already in 0.1, but 0.2 completes and fixes the biggest problem with 0.1.

There are some issues I've thought about 0.3. Some of them will be definitely implemented in 0.3 (e.g. tzfile support), some others are somewhat illusive (e.g. additional format syntax). But these issues do not replace the users' feedback. Personally I'd like to thank /u/savage884, who brought the problem of local date handling in the /r/rust post. This post crucially helped me finishing the work and releasing 0.2. I wish others give a feedback for Chrono like that, so Chrono doesn't remain an library "considered annoying but designed so", which situation I really don't like.