A look at serde-json
My fastnbt
crate that deserializes NBT from Minecraft: Java Edition, has
some issues. The main ones to me are:
- Does not work with the
Read
trait. - Fails to deserialize tuples and arrays.
- The NBT array types do not work with untagged enums.
Writing this deserializer was my first real project in Rust, and it contains a
fair few 'nooby' mistakes. I want to take some time to study the serde_json
crate so that I can rewrite my crate to a better standard. Let's go!
Basic elements
There are a bunch of things to look at before we get to the meat of the
deserializer. In particular the machinery around the Read
trait, &str
and
slices that make zero-copy deserialization possible without much duplication.
from_*
There exists from_str
, from_slice
and from_reader
, all of which basically
construct a Deserializer
like so:
// from_str Deserializer::new(read::StrRead::new(s))
These are the expected 'entry points' for running the deserializer for a user.
Read
machinery
Serde-json adopts a pattern to abstract away the details of zero-copy
deserializing and deserializing from a Read
implementation. It does this
through it's own Read
trait:
pub trait Read<'de>: private::Sealed { /* ... */ }
This is a super-trait of a Sealed
trait whose only purpose is to be private and
prevent crates outside of this one from creating implementations.
The functions in the trait use Position
a lot, which is just a line and
column, presumably mostly for error reporting.
It also has a Reference
type that looks like this:
pub enum Reference<'b, 'c, T> where T: ?Sized + 'static, { Borrowed(&'b T), Copied(&'c T), }
This allows two different lifetimes for a reference. In practice the 'borrowed'
lifetime 'b
is the lifetime of the input ('de
), and the copied lifetime 'c
is that of the scratch space, which would be something like a &str
or a &[u8]
or
something implementing the standard Read
.
Looking at one of these methods specifically:
fn parse_str<'s>(&'s mut self, scratch: &'s mut Vec<u8>) -> Result<Reference<'de, 's, str>>;
This shows that the copied lifetime is shared with self
and scratch
. This
scratch space can be used to store the actual string if it has to be copied.
This seems to be the primary mechanism that allows zero-copy and deserializing
from std Read
objects. A zero-copy implementation can return a
Reference::Borrowed
with the lifetime of the input.
Looking at the implementations for std Read
and slices:
// for Read objects, R is the Read object fn parse_str<'s>(&'s mut self, scratch: &'s mut Vec<u8>) -> Result<Reference<'de, 's, str>> { R::parse_str(self, scratch) }
// for slices fn parse_str_raw<'s>( &'s mut self, scratch: &'s mut Vec<u8>, ) -> Result<Reference<'a, 's, [u8]>> { self.parse_str_bytes(scratch, false, |_, bytes| Ok(bytes)) } fn parse_str_bytes<'s, T, F>( &'s mut self, scratch: &'s mut Vec<u8>, validate: bool, result: F, ) -> Result<Reference<'a, 's, T>> where T: ?Sized + 's, F: for<'f> FnOnce(&'s Self, &'f [u8]) -> Result<&'f T>, { // Index of the first byte not yet copied into the scratch space. let mut start = self.index; loop { while self.index < self.slice.len() && !ESCAPE[self.slice[self.index] as usize] { self.index += 1; } if self.index == self.slice.len() { return error(self, ErrorCode::EofWhileParsingString); } match self.slice[self.index] { b'"' => { if scratch.is_empty() { // Fast path: return a slice of the raw JSON without any // copying. let borrowed = &self.slice[start..self.index]; self.index += 1; return result(self, borrowed).map(Reference::Borrowed); } else { scratch.extend_from_slice(&self.slice[start..self.index]); self.index += 1; return result(self, scratch).map(Reference::Copied); } } b'\\' => { // snip } _ => { // snip } } } }
The parse_str_bytes
is the bulk of the complexity here. It is iterating the
bytes of the potential str
until it finds either a control character, the
escape character \
or the closing quote of the string "
. The interesting
part to us is the quote.
The scratch space is used if the raw bytes we're operating on cannot be a str
.
This occurs when there are escaped characters. If the raw JSON contains a string
such as hello\nworld
, we want our &str
to have an actual new line character
in it, not the two separate \
and n
characters. This means we cannot treat
the raw bytes of the slice as the &str
, so we're forced to copy the data
to scratch
and return a Reference::Copied
.
If there isn't an escape character, then we can simply view the subslice as a
&str
and avoid the copy, 'the fast path'.
For the std Read
, there is the serde-json IoRead
wrapper, which is what
implements the serde-json Read
for std Read
types. The parse_str
for this
always returns a copied string in the scratch space since it cannot borrow from
the underlying data.
Visitor
I will not cover the Visitor trait from serde here. Previous articles of mine talk about this.
The Deserializer!
The actual deserializer. The impl
block for implementing the serde deserializer
trait looks like this:
impl<'de, 'a, R: Read<'de>> de::Deserializer<'de> for &'a mut Deserializer<R> {...}
A key thing to recognize here is that the trait is implemented on a mutable reference to the deserializer, not the deserializer itself.
The deserializer stores the reader and the scratch space:
pub struct Deserializer<R> { read: R, scratch: Vec<u8>, // snip }
Notice that nothing is stored about the layers of JSON that we're in. That is effectively stored in the call stack of deserializer objects that we will see later.
deserialize_any
is implemented and peeks a byte from the reader. If this
character represents a 'simple' value like a number or true
then it calls the
appropriate method on the passed in visitor.
If it's a complex value like an array or another object, then an 'access' struct
is created. For JSON arrays we create an object implementing SeqAccess
from
serde. For JSON objects we create an object implementing MapAccess
. These
store an exclusive reference to the serde-json Deserializer
, and are passed
by value to the visitor. To repeat, they store an exclusive reference to the
serde-json Deserializer
, not the serde Deserializer
trait. This
reference is what implements the deserializer trait. We are not creating a
reference to a reference here. When passing the deserializer to these access
structs we pass exactly self
, not &mut self
.
SeqAccess
SeqAccess
is a 'simple' trait in that you only have to implement one method,
next_element_seed
. This gets called for each element of the sequence we're
deserializing. It's main job is to call the deserialize
method on the element
T
that is being deserialized. It needs to eat any input that is before the
element. For JSON arrays you might have [1, 2, 3]
. The next_element_seed
will have input either directly after the [
or after a previous element. So it
needs to eat any whitespace as well as a comma if appropriate.
The call stack is larger here. If we're in an object in an array, our call-stack will contain the deserializer, a MapAccess call, and a SeqAccess call (with bits in between).
There is an important distinction between 'bounded' and 'unbounded' sequences.
When something like a Vec
is being deserialized, next_element_seed
will be
called over and over until the method itself returns a None
, indicating there
are no more elements. But when a fixed sized type like [i32;3]
or (u32,u32)
are being deserialized, the next_element_seed
is called the exact number of
times required. This is why fastnbt
currently fails for tuples and arrays.
This means that some bytes still relevant to the end of the sequence might not
have been 'eaten' by the time you finish the SeqAccess
execution. This
requires that the Deserializer that started the entire sequence deserializing
process needs to finish off parsing any bytes at the end of an array (in JSON,
this is the closing ]
).
The SeqAccess
in serde-json only peeks the closing ]
in the unbounded
sequence case, meaning it should always be 'uneaten'. This means SeqAccess
always leaves the square brackets.
MapAccess
This follows largely the same principle as SeqAccess
. It handles everything
within the braces of a JSON object. The main difference is we now have two
methods, next_key_seed
and next_value_seed
. These are just like
next_element_seed
except for the keys and values of an object respectively.
next_key_seed
must parse anything that was after the previous value i.e.
whitespace and the comma. next_value_seed
must eat anything after the key i.e.
whitespace and the colon.
Unlike SeqAccess
, it appears that next_key_seed
will be called repeatedly
until it returns None
, at least for deserializing into Rust structs and
HashMap
. In theory a type could implement Deserialize
and not do this. It is
probably wise to do any required clean-up outside of the SeqAccess
and throw
an error if there are more fields to process, similar to SeqAccess
.
If next_key_seed
returns a value (not None
) then next_value
must return a
value. This is why next_value_seed
doesn't return an option, but
next_key_seed
does. next_key_seed
determines if there is another pair. By
the time you get to the value it has to exist.
tri!()
serde-json has a macro called tri!
, it's essentially ?
without the
conversion:
// We only use our own error type; no need for From conversions provided by the // standard library's try! macro. This reduces lines of LLVM IR by 4%. macro_rules! tri { ($e:expr $(,)?) => { match $e { core::result::Result::Ok(val) => val, core::result::Result::Err(err) => return core::result::Result::Err(err), } }; }
As the comment says, this is a perf/size optimization.
parse_whitespace
This is a convenience method that eats all available whitespace, and returns the first non-whitespace character without consuming it.
Deserialize implementations
That covers a lot of the main machinery of serde-json that is relevant to me.
Looking at the various Deserialize
implementations that serde provides can
also help us understand what calls we can expect to receive for our
deserializer.
Here's some of the main ones.
Vec
This calls deserialize_seq
on our Deserializer
with a custom visitor. This
visitor implements visit_seq
, which in turn will call next_element
in a loop
until next_element
returns None
. This ultimately is calling the
next_element_seed
that SeqAccess
implemented.
A SeqAccess
can also implement a size hint. This is a (potentially wrong,
potentially not given) number of elements that are expected to be deserialized as
judged by the deserializer itself. So if you write a deserializer for a format
that tells you the length of sequences before the sequence itself, you can
provide a hint. JSON does not have this so doesn't provide a hint.
Tuples
These are implemented via a macro called tuple_impls!
. The deserialize method
calls deserialize_tuple
on our Deserializer, with a custom visitor. The custom
visitor implements visit_seq
:
fn visit_seq<A>(self, mut seq: A) -> Result<Self::Value, A::Error> where A: SeqAccess<'de>, { $( let $name = match try!(seq.next_element()) { Some(value) => value, None => return Err(Error::invalid_length($n, &self)), }; )+ Ok(($($name,)+)) }
The $(...)+
syntax expands for each element of the tuple being deserialized.
This means that for each element of the tuple we call seq.next_element
exactly
once.
Unlike the implementation for Vec
, we do not call next_element
until
None
is returned.
Arrays
These are implemented via a macro called array_impls!
, and ultimately passes
it off to the tuple implementation. So arrays and tuples are treated pretty much
identically in serde, so much so that it's actually deserialize_tuple
that is
called not deserialize_seq
like you might expect.
Slices
You cannot deserialize a slice of arbitrary T
. A slice like &[T]
needs the
sequence to be stored somewhere in memory, it has to be borrowed from
somewhere. The only viable place when deserializing would be the bytes that
are being deserialized. But for an arbitrary type T
there's no guarentee that
you could just cast that bit of memory and pretend that it is a T
. There are
many problems here such as alignment requirements of the type and endianness.
We can however do this if T
is something like u8
. Deserialize is implemented
for &[u8]
and calls deserialize_bytes
with a visitor that only implements
the visit_borrowed_bytes
and visit_borrowed_str
methods. The bytes have to
be borrowed from the input. This could be implemented for &[i8]
as well I
believe, but is not.
Structs
The serde crate doesn't actually implement Deserialize
for structs. It can't.
It doesn't know what your structs look like! This is what the Deserialize
derive macro in serde does. My first serde
article explores the
code produced by this macro.
HashMap
This calls deserialize_map
on a visitor that implements visit_map
. The visit
method calles MapAccess::next_entry
until None
is returned. next_entry
just calls next_key
followed by next_value
and returns the tuple of them
both or None
.
End
These are mostly notes to myself, but thought they might be useful to others. Cheers.