OwenGage.com: writing

Exploring serde's data model with a toy deserializer

2021-08-14

Previously I did a shallow dive into understanding serde by expanding the Deserialize macro. This time I'll go deeper and map how each type in serde's data model is treated.

This is written for people intending to write their own deserializer for serde, but might be interesting for more people. Later in the article we will implement a toy deserializer for some made up format.

There are three major categories of types to consider:

  1. Sequences of values (think Vec, slice, tuple)
  2. Maps from keys to values (think HashMap, struct)
  3. Everything else

The 'Everything else' turns out to be the easiest, so lets start there.

Table of everything else

The basic steps followed between you and serde when writing a deserializer is:

  1. Deserialize::deserialize is called on some type.
  2. This calls a deserialize_* method on your deserializer, passing you a Vistitor.
  3. You call appropriate visit_* methods for that call.

For example, calling deserialize on a String calls deserialize_string on your deserializer, you deserialize a string from your input and call visit_string on the visitor you were passed.

When writing a deserializer I wish I knew

  • what real Rust types would end up calling my deserialize_* methods.
  • which visit_* methods my deserializer should call in response.

Before answering these questions, lets look at the Visitor trait. This trait has default implementations of each method. Some return an error, and others forward the call to another visitor method.

This will be important later, since it means our deserializer can call a larger set of methods than you would naïvely expect.

Here's a summary of all the forwarded methods:

Visitor methodForwarded to
visit_i8visit_i64
visit_i16
visit_i32
visit_u8visit_u64
visit_u16
visit_u32
visit_f32visit_f64
visit_charvisit_str
visit_borrowed_strvisit_str
visit_stringvisit_str
visit_borrowed_bytesvisit_bytes
visit_byte_bufvisit_bytes

(For completeness, the methods that do not forward are visit_bool, visit_i64, visit_u64, visit_f64, visit_str, visit_bytes, visit_none, visit_some, visit_unit, visit_newtype_struct, visit_seq, visit_map, and visit_enum.)

Okay, back to who calls and expects what.

This table is mostly put together from impls.rs in serde. Each row has a Rust type that is being deserialized, the deserialize call that will be called on the deserializer, and finally the acceptable method calls on the visitor passed to the deserializer.

type being deserializedDeserializer method calledExpected Visitor method calls
struct field identifierdeserialize_identifervisit_str, visit_bytes, but also visit_u64 where the number is field number. (forwarded from: visit_char,visit_borrowed_str, visit_string,visit_borrowed_bytes, visit_byte_buf, visit_{u8,16,32})
booldeserialize_boolvisit_bool
()deserialize_unitvisit_unit
i8deserialize_i8
  • visit_i{8,16,32,64}
  • visit_u{8,16,32,64}
with fallible conversions where the value is out of range.
i16deserialize_i16
i32deserialize_i32
i64deserialize_i64
isizedeserialize_i64
u8deserialize_u8
u16deserialize_u16
u32deserialize_u32
u64deserialize_u64
usizedeserialize_u64
f32deserialize_f32
  • visit_f64
  • visit_f32
  • visit_i{8,16,32,64}
  • visit_u{8,16,32,64}
with infallible conversions.
f64deserialize_f64
chardeserialize_charvisit_char, visit_str erroring if string length not exactly 1. (forwarded from: visit_borrowed_str, visit_string)
Stringdeserialize_string
  • visit_str
  • visit_string
  • visit_bytes, erroring if not unicode
  • visit_byte_buf, erroring if not unicode, stealing Vec<u8>'s allocation
(forwarded from: visit_char,visit_borrowed_str,visit_borrowed_bytes)
&'a strdeserialize_str
  • visit_borrowed_str
  • visit_borrowed_bytes, erroring if not unicode
&'a [u8]deserialize_bytes
  • visit_borrowed_str
  • visit_borrowed_bytes
Option<T>deserialize_optionvisit_some, visit_none, visit_unit
PhantomData<T>deserialize_unit_structvisit_unit
Vec<T>deserialize_seqvisit_seq
[T; N]deserialize_tuple (!)
tuples eg (T,U)deserialize_tuple
BTreeMap<K, V>deserialize_mapvisit_map
HashMap<K, V>

Most of this is unsurprising. Some things that sticks out to me are:

  • an array causes deserialize_tuple to be called. I guess it is similar to tuples in that arrays and tuples are fixed sizes, unlike a Vec.
  • conversions to floating point use the as keyword to convert from integer types. The reference shows that this will convert to the nearest float, and that overflow to infinity can only happen for u128 to f321. So this conversion seems sensible.
  • you cannot visit_bytes for deserializing a char like you would for a String.

Sequences

Sequences are the next level up in difficulty for our category of types.

When a type like a Vec<T> is deserialized, it calls deserialize_seq on the Vec, passing in a visitor with visit_seq implemented.

Here's the actual implementation in impls.rs:

impl<'de, T> Deserialize<'de> for Vec<T>
where
    T: Deserialize<'de>,
{
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: Deserializer<'de>,
    {
        struct VecVisitor<T> {
            marker: PhantomData<T>,
        }

        impl<'de, T> Visitor<'de> for VecVisitor<T>
        where
            T: Deserialize<'de>,
        {
            type Value = Vec<T>;

            fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
                formatter.write_str("a sequence")
            }

            fn visit_seq<A>(self, mut seq: A) -> Result<Self::Value, A::Error>
            where
                A: SeqAccess<'de>,
            {
                let mut values = Vec::with_capacity(size_hint::cautious(seq.size_hint()));

                while let Some(value) = try!(seq.next_element()) { // ❷
                    values.push(value);
                }

                Ok(values)
            }
        }

        let visitor = VecVisitor {
            marker: PhantomData,
        };
        deserializer.deserialize_seq(visitor) // ❶
    }

    // ... snipped out deserialize_in_place
}

The steps here are:

  1. deserialize_seq gets called with a visitor. ❶
  2. The deserializer sets up an object implementing SeqAccess, and calls visit_seq with it.
  3. The visitor repeatedly calls next_element on the SeqAccess object until None is returned (or it errors). ❷

This is much like with the Iterator trait, but fallible. Some other stuff happens, like the visitor asking for a hint to the size for efficiency, but a SeqAccess object doesn't have to provide this.

Let's look at the SeqAccess trait:

pub trait SeqAccess<'de> {
    type Error: Error;

    fn next_element_seed<T>(&mut self, seed: T) -> Result<Option<T::Value>, Self::Error>
    where
        T: DeserializeSeed<'de>;

    fn next_element<T>(&mut self) -> Result<Option<T>, Self::Error>
    where
        T: Deserialize<'de>,
    {
        self.next_element_seed(PhantomData)
    }

    fn size_hint(&self) -> Option<usize> {
        None
    }
}

The only method without a default implementation is next_element_seed. If you're writing a deserializer, this is what you need to implement. When it is called it is expected that you deserialize a T from your input and return it, or None if it's the end of the sequence.

In practice this usually means you call T::deserialize repeatedly (recursively depending on other bits of your deserializer). You just have to correctly figure out when there are no more elements. This might be easy or hard depending on the data format you're deserializing.

Example data format

Let's invent a very trivial data format to demonstrate sequences. Our format is simply going to be a list of integers, where each integer is 3 bytes, with a single byte at the front to tell us the length of the sequence. That's it.

1 byte3 bytes...
length of intsbig-endian integerrepeat

We're going to make the following work:

#[test]
fn three_byte_format() {
    let data = [
        3, // single byte for length
        0, 0, 1, // first 3 byte int
        0, 0, 2,
        0, 0, 3];

    let mut deserializer = Deserializer::from_bytes(&data);
    let res = Vec::<i32>::deserialize(&mut deserializer).unwrap();
    assert_eq!(&[1, 2, 3], res.as_slice());
}

Let's get the boilerplate our of the way. We need to set up our own custom error, which is just going to be a wrapped String. We aren't valuing good errors in our toy format, sorry:

use std::io::Read;

use serde::de::{Error as _, SeqAccess, Visitor};

#[derive(Debug)]
struct Error(String);

impl std::fmt::Display for Error {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        f.write_str(&self.0)
    }
}
impl std::error::Error for Error {}
impl serde::de::Error for Error {
    fn custom<T>(msg: T) -> Self
    where
        T: std::fmt::Display,
    {
        Error(msg.to_string())
    }
}

struct Deserializer<'de> {
    input: &'de [u8],
}

impl<'de> Deserializer<'de> {
    pub fn from_bytes(input: &'de [u8]) -> Self {
        Self { input }
    }
}

impl<'de, 'a> serde::de::Deserializer<'de> for &'a mut Deserializer<'de> {
    serde::forward_to_deserialize_any! {
        bool i8 i16 i32 i64 u8 u16 u32 u64 f32 f64 char str string
        byte_buf option unit unit_struct newtype_struct tuple tuple_struct
        seq map struct enum identifier ignored_any bytes
    }

    type Error = Error;

    fn deserialize_any<V>(self, visitor: V) -> Result<V::Value, Self::Error>
    where
        V: serde::de::Visitor<'de>,
    {
        use serde::de::Error;
        Err(Self::Error::custom("expected sequence"))
    }
}

This is pretty much the minimum required to make things compile. There's a few things to note here:

  • serde has it's own Error trait we need to implement that requires a custom method. This lets serde report deserialization errors, eg 'exepected seq'.
  • Our deserializer just takes some bytes. Most real deserializers allow you to provide anything that implements Read. The general approach is the same.
  • We use forward_to_deserialize_any! to implement all of the required methods. Our 'any' just errors.
  • We're implementing Deserializer for &'a mut Deserializer, not the value itself. This is crucial later.

First thing is to actually implement deserialize_seq since that is what will be called when deserializing a Vec<T>. We add the below implementation and remove seq from forward_to_deserialize_any!.

fn deserialize_seq<V>(mut self, visitor: V) -> Result<V::Value, Self::Error>
where
    V: Visitor<'de>,
{
    let mut length = [0u8; 1];
    self.input
        .read_exact(&mut length)
        .map_err(|_| Error::custom("read error"))?;

    let length = length[0] as usize;

    visitor.visit_seq(OurSeqAccess {
        inner: &mut self,
        remaining_length: length,
    })
}

The main thing we need to do here is get the length of the sequence that's coming, and get the input to the point where the next thing to be deserialized is the first element. In our format this is one in the same. For some formats the length won't exist ahead of time at all (like an array in JSON). As long as we can find the end it's fine (so ] in JSON).

We create a new type that implements SeqAccess that we're calling OurSeqAccess. This stores the length and the deserializer. Here's the implementation:

struct OurSeqAccess<'a, 'de> {
    inner: &'a mut Deserializer<'de>,
    remaining_length: usize,
}

impl<'a, 'de> SeqAccess<'de> for OurSeqAccess<'a, 'de> {
    type Error = Error;

    fn next_element_seed<T>(&mut self, seed: T) -> Result<Option<T::Value>, Self::Error>
    where
        T: serde::de::DeserializeSeed<'de>,
    {
        if self.remaining_length > 0 {
            self.remaining_length -= 1;
            let el = seed.deserialize(&mut *self.inner)?;
            Ok(Some(el))
        } else {
            Ok(None)
        }
    }
}

Some important details

We need the second lifetime 'a for the lifetime of the deserializer itself. The 'de lifetime is that of the input data.

This code is why implementing Deserializer on the mutable reference is crucial. If we implemented it on our deserializer directly, we would have to pass a clone to seed.deserialize. This would be bad, because when we consume input, it wouldn't be visible outside of the current function call (the slice in the original deserializer would not change). By implementing on the &mut we can continue to consume input.

Back to our implementation...

So! All we're doing in our next_element_seed checking there are more elements to read, then calling seed.deserialize with our deserializer. At this point we do not know the concrete type we are currently deserializing, we just have T. By calling deserialize on this T, we will call the relevant deserialize_* method.

Unlike the visitor, each deserialize_* method does not have a default implementation. So to support i32 we need to implement exactly deserialize_i32 on our deserializer (removing it from the forward_to_deserialize_any! macro), like so:

fn deserialize_i32<V>(self, visitor: V) -> Result<V::Value, Self::Error>
where
    V: Visitor<'de>,
{
    let mut buf = [0; 3];

    match self.input.read_exact(&mut buf) {
        Ok(_) => {
            // convert buffer to integer.
            let mut i = (buf[0] as i32) << 16;
            i += (buf[1] as i32) << 8;
            i += buf[2] as i32;

            visitor.visit_i32(i)
        }
        Err(_) => Err(Error::custom("could not read next integer")),
    }
}

Here we try to read exactly 3 bytes. If we're successful we do some bit shifting to create an i32 and call the visit_i32 method on it.

It is at this point where a visitor implementation might not have this method implemented, and it would be forwarded to visit_i64 thanks to the default implementation.

Our test now passes!

#[test]
fn three_byte_format() {
    let data = [
        3, // single byte for length
        0, 0, 1, // first 3 byte int
        0, 0, 2,
        0, 0, 3];

    let mut deserializer = Deserializer::from_bytes(&data);
    let res = Vec::<i32>::deserialize(&mut deserializer).unwrap();
    assert_eq!(&[1, 2, 3], res.as_slice());
}
$ cargo test
...
running 1 test
test tests::three_byte_format ... ok

Beyond a toy

If we changed the Vec<i32> in the test to Vec<i64>, then it would fail, because our deserializer does not implement deserialize_i64 (or rather, it does, but it forwards to deserialize_any, which errors).

In a more robust implementation, you would implement all of the numeric methods on the deserializer.

Maps

Maps are the most complex part to handle, and I'm going to leave quite a lot of it as an excerise to the reader. The structure and handling is very similar to the seq stuff we've just implemented, but with keys and values.

Here's a trimmed version of the MapAccess trait:

pub trait MapAccess<'de> {
    type Error: Error;

    fn next_key_seed<K>(&mut self, seed: K) -> Result<Option<K::Value>, Self::Error>
    where
        K: DeserializeSeed<'de>;

    fn next_value_seed<V>(&mut self, seed: V) -> Result<V::Value, Self::Error>
    where
        V: DeserializeSeed<'de>;

    //...
}

Similar to needing to implement next_element_seed for SeqAccess, we need to implement next_key_seed and next_value_seed.

We'd create one of these in the deserialize_map method of our deserializer. We'd set up the input to be ready to deserialize the first key of the map. Somewhere in these next methods you would need to get the input ready for the first value, then the second key, etc.

End

These are the major parts for creating a deserializer. I would recommend looking at the serde docs for a deserializer for more, as well as looking at existing implementations like serde_json. My own fastnbt crate also has a deserializer for the Minecraft NBT format, which might be simpler due to only working with byte slices.

Thanks for reading! Hope it was helpful.

Footnotes

  1. serde does support u128 behind a feature flag. It does not do float conversion for 128-bit integers by default)