Owen Gage: writing

A look at serde-json

2022-07-22

My fastnbt crate that deserializes NBT from Minecraft: Java Edition, has some issues. The main ones to me are:

  1. Does not work with the Read trait.
  2. Fails to deserialize tuples and arrays.
  3. The NBT array types do not work with untagged enums.

Writing this deserializer was my first real project in Rust, and it contains a fair few 'nooby' mistakes. I want to take some time to study the serde_json crate so that I can rewrite my crate to a better standard. Let's go!

Basic elements

There are a bunch of things to look at before we get to the meat of the deserializer. In particular the machinery around the Read trait, &str and slices that make zero-copy deserialization possible without much duplication.

from_*

There exists from_str, from_slice and from_reader, all of which basically construct a Deserializer like so:

// from_str
Deserializer::new(read::StrRead::new(s))

These are the expected 'entry points' for running the deserializer for a user.

Read machinery

Serde-json adopts a pattern to abstract away the details of zero-copy deserializing and deserializing from a Read implementation. It does this through it's own Read trait:

pub trait Read<'de>: private::Sealed { /* ... */ }

This is a super-trait of a Sealed trait whose only purpose is to be private and prevent crates outside of this one from creating implementations.

The functions in the trait use Position a lot, which is just a line and column, presumably mostly for error reporting.

It also has a Reference type that looks like this:

pub enum Reference<'b, 'c, T>
where
    T: ?Sized + 'static,
{
    Borrowed(&'b T),
    Copied(&'c T),
}

This allows two different lifetimes for a reference. In practice the 'borrowed' lifetime 'b is the lifetime of the input ('de), and the copied lifetime 'c is that of the scratch space, which would be something like a &str or a &[u8] or something implementing the standard Read.

Looking at one of these methods specifically:

fn parse_str<'s>(&'s mut self, scratch: &'s mut Vec<u8>) -> 
    Result<Reference<'de, 's, str>>;

This shows that the copied lifetime is shared with self and scratch. This scratch space can be used to store the actual string if it has to be copied. This seems to be the primary mechanism that allows zero-copy and deserializing from std Read objects. A zero-copy implementation can return a Reference::Borrowed with the lifetime of the input.

Looking at the implementations for std Read and slices:

// for Read objects, R is the Read object
fn parse_str<'s>(&'s mut self, scratch: &'s mut Vec<u8>) 
    -> Result<Reference<'de, 's, str>> {
    R::parse_str(self, scratch)
}
// for slices
fn parse_str_raw<'s>(
    &'s mut self,
    scratch: &'s mut Vec<u8>,
) -> Result<Reference<'a, 's, [u8]>> {
    self.parse_str_bytes(scratch, false, |_, bytes| Ok(bytes))
}

fn parse_str_bytes<'s, T, F>(
        &'s mut self,
        scratch: &'s mut Vec<u8>,
        validate: bool,
        result: F,
    ) -> Result<Reference<'a, 's, T>>
    where
        T: ?Sized + 's,
        F: for<'f> FnOnce(&'s Self, &'f [u8]) -> Result<&'f T>,
    {
        // Index of the first byte not yet copied into the scratch space.
        let mut start = self.index;

        loop {
            while self.index < self.slice.len() 
                && !ESCAPE[self.slice[self.index] as usize] {
                self.index += 1;
            }
            if self.index == self.slice.len() {
                return error(self, ErrorCode::EofWhileParsingString);
            }
            match self.slice[self.index] {
                b'"' => {
                    if scratch.is_empty() {
                        // Fast path: return a slice of the raw JSON without any
                        // copying.
                        let borrowed = &self.slice[start..self.index];
                        self.index += 1;
                        return result(self, borrowed).map(Reference::Borrowed);
                    } else {
                        scratch.extend_from_slice(&self.slice[start..self.index]);
                        self.index += 1;
                        return result(self, scratch).map(Reference::Copied);
                    }
                }
                b'\\' => {
                    // snip
                }
                _ => {
                    // snip
                }
            }
        }
    }

The parse_str_bytes is the bulk of the complexity here. It is iterating the bytes of the potential str until it finds either a control character, the escape character \ or the closing quote of the string ". The interesting part to us is the quote.

The scratch space is used if the raw bytes we're operating on cannot be a str. This occurs when there are escaped characters. If the raw JSON contains a string such as hello\nworld, we want our &str to have an actual new line character in it, not the two separate \ and n characters. This means we cannot treat the raw bytes of the slice as the &str, so we're forced to copy the data to scratch and return a Reference::Copied.

If there isn't an escape character, then we can simply view the subslice as a &str and avoid the copy, 'the fast path'.

For the std Read, there is the serde-json IoRead wrapper, which is what implements the serde-json Read for std Read types. The parse_str for this always returns a copied string in the scratch space since it cannot borrow from the underlying data.

Visitor

I will not cover the Visitor trait from serde here. Previous articles of mine talk about this.

The Deserializer!

The actual deserializer. The impl block for implementing the serde deserializer trait looks like this:

impl<'de, 'a, R: Read<'de>> de::Deserializer<'de> for &'a mut Deserializer<R> {...}

A key thing to recognize here is that the trait is implemented on a mutable reference to the deserializer, not the deserializer itself.

The deserializer stores the reader and the scratch space:

pub struct Deserializer<R> {
    read: R,
    scratch: Vec<u8>,
    // snip
}

Notice that nothing is stored about the layers of JSON that we're in. That is effectively stored in the call stack of deserializer objects that we will see later.

deserialize_any is implemented and peeks a byte from the reader. If this character represents a 'simple' value like a number or true then it calls the appropriate method on the passed in visitor.

If it's a complex value like an array or another object, then an 'access' struct is created. For JSON arrays we create an object implementing SeqAccess from serde. For JSON objects we create an object implementing MapAccess. These store an exclusive reference to the serde-json Deserializer, and are passed by value to the visitor. To repeat, they store an exclusive reference to the serde-json Deserializer, not the serde Deserializer trait. This reference is what implements the deserializer trait. We are not creating a reference to a reference here. When passing the deserializer to these access structs we pass exactly self, not &mut self.

SeqAccess

SeqAccess is a 'simple' trait in that you only have to implement one method, next_element_seed. This gets called for each element of the sequence we're deserializing. It's main job is to call the deserialize method on the element T that is being deserialized. It needs to eat any input that is before the element. For JSON arrays you might have [1, 2, 3]. The next_element_seed will have input either directly after the [ or after a previous element. So it needs to eat any whitespace as well as a comma if appropriate.

The call stack is larger here. If we're in an object in an array, our call-stack will contain the deserializer, a MapAccess call, and a SeqAccess call (with bits in between).

There is an important distinction between 'bounded' and 'unbounded' sequences. When something like a Vec is being deserialized, next_element_seed will be called over and over until the method itself returns a None, indicating there are no more elements. But when a fixed sized type like [i32;3] or (u32,u32) are being deserialized, the next_element_seed is called the exact number of times required. This is why fastnbt currently fails for tuples and arrays.

This means that some bytes still relevant to the end of the sequence might not have been 'eaten' by the time you finish the SeqAccess execution. This requires that the Deserializer that started the entire sequence deserializing process needs to finish off parsing any bytes at the end of an array (in JSON, this is the closing ]).

The SeqAccess in serde-json only peeks the closing ] in the unbounded sequence case, meaning it should always be 'uneaten'. This means SeqAccess always leaves the square brackets.

MapAccess

This follows largely the same principle as SeqAccess. It handles everything within the braces of a JSON object. The main difference is we now have two methods, next_key_seed and next_value_seed. These are just like next_element_seed except for the keys and values of an object respectively. next_key_seed must parse anything that was after the previous value i.e. whitespace and the comma. next_value_seed must eat anything after the key i.e. whitespace and the colon.

Unlike SeqAccess, it appears that next_key_seed will be called repeatedly until it returns None, at least for deserializing into Rust structs and HashMap. In theory a type could implement Deserialize and not do this. It is probably wise to do any required clean-up outside of the SeqAccess and throw an error if there are more fields to process, similar to SeqAccess.

If next_key_seed returns a value (not None) then next_value must return a value. This is why next_value_seed doesn't return an option, but next_key_seed does. next_key_seed determines if there is another pair. By the time you get to the value it has to exist.

tri!()

serde-json has a macro called tri!, it's essentially ? without the conversion:

// We only use our own error type; no need for From conversions provided by the
// standard library's try! macro. This reduces lines of LLVM IR by 4%.
macro_rules! tri {
    ($e:expr $(,)?) => {
        match $e {
            core::result::Result::Ok(val) => val,
            core::result::Result::Err(err) => return 
                core::result::Result::Err(err),
        }
    };
}

As the comment says, this is a perf/size optimization.

parse_whitespace

This is a convenience method that eats all available whitespace, and returns the first non-whitespace character without consuming it.

Deserialize implementations

That covers a lot of the main machinery of serde-json that is relevant to me. Looking at the various Deserialize implementations that serde provides can also help us understand what calls we can expect to receive for our deserializer.

Here's some of the main ones.

Vec

This calls deserialize_seq on our Deserializer with a custom visitor. This visitor implements visit_seq, which in turn will call next_element in a loop until next_element returns None. This ultimately is calling the next_element_seed that SeqAccess implemented.

A SeqAccess can also implement a size hint. This is a (potentially wrong, potentially not given) number of elements that are expected to be deserialized as judged by the deserializer itself. So if you write a deserializer for a format that tells you the length of sequences before the sequence itself, you can provide a hint. JSON does not have this so doesn't provide a hint.

Tuples

These are implemented via a macro called tuple_impls!. The deserialize method calls deserialize_tuple on our Deserializer, with a custom visitor. The custom visitor implements visit_seq:

fn visit_seq<A>(self, mut seq: A) -> Result<Self::Value, A::Error>
where
    A: SeqAccess<'de>,
{
    $(
        let $name = match try!(seq.next_element()) {
            Some(value) => value,
            None => return Err(Error::invalid_length($n, &self)),
        };
    )+

    Ok(($($name,)+))
}

The $(...)+ syntax expands for each element of the tuple being deserialized. This means that for each element of the tuple we call seq.next_element exactly once.

Unlike the implementation for Vec, we do not call next_element until None is returned.

Arrays

These are implemented via a macro called array_impls!, and ultimately passes it off to the tuple implementation. So arrays and tuples are treated pretty much identically in serde, so much so that it's actually deserialize_tuple that is called not deserialize_seq like you might expect.

Slices

You cannot deserialize a slice of arbitrary T. A slice like &[T] needs the sequence to be stored somewhere in memory, it has to be borrowed from somewhere. The only viable place when deserializing would be the bytes that are being deserialized. But for an arbitrary type T there's no guarentee that you could just cast that bit of memory and pretend that it is a T. There are many problems here such as alignment requirements of the type and endianness.

We can however do this if T is something like u8. Deserialize is implemented for &[u8] and calls deserialize_bytes with a visitor that only implements the visit_borrowed_bytes and visit_borrowed_str methods. The bytes have to be borrowed from the input. This could be implemented for &[i8] as well I believe, but is not.

Structs

The serde crate doesn't actually implement Deserialize for structs. It can't. It doesn't know what your structs look like! This is what the Deserialize derive macro in serde does. My first serde article explores the code produced by this macro.

HashMap

This calls deserialize_map on a visitor that implements visit_map. The visit method calles MapAccess::next_entry until None is returned. next_entry just calls next_key followed by next_value and returns the tuple of them both or None.

End

These are mostly notes to myself, but thought they might be useful to others. Cheers.