Owen Gage: writing

Mapping into the serde data model

2022-08-05

The official Serde documentation talks about mapping into the data model and gives a demonstration using OsString. I want to walk through an example of doing this, showing the code to achieve it as we go. This assumes you already know the basics of implementing a Serde data format.

'Mapping into the data model' is when the mapping from the data format to the Serde data model is not straight forward. Usually you're doing it to capture extra information. For OsString, modelling it as an enum allows the deserializer to know the platform it was originally made on, and error if they are not compatible.

Worked example: Minecraft's arrays

The fastnbt crate for Minecraft's Named Binary Tag (NBT) format utilizes mapping into the Serde data model to accurately preserve NBT's redundant types.

NBT has a general list type for storing homogeneous NBT types, but also has byte arrays, integer arrays, and long arrays. This means that there are multiple ways of representing the same data in NBT.

Should a Vec<u8> serialize to a list of bytes, or a byte array? If a list of bytes deserializes to Vec<u8>, what does a byte array deserialize to? If they both became a Vec, you can't reliably serialize back to the same NBT data. Minecraft will not accept the wrong one, so it's important that FastNBT preserves this information.

This is a similar problem to OsString. Ultimately you want to encode more information in the Rust object. This might be to prevent errors, improve correctness, or improve performance.

fastnbt handles this by mapping the NBT array types into the serde data model as Maps containing a single key/value, where the key has a special value. If you could see the data model, an NBT long array might look like this:

Map
    key: string("__fastnbt_long_array")
    value: bytes([1,2,3,...])
end Map

There are several pieces to make this work:

  • the deserializer calls visit methods to mimic a map structure when it sees an NBT array.
  • the serializer detects the __fastnbt_long_array key and serializes an NBT array.
  • (optional) provide a custom Serialize and Deserialize implementation.

Let's go through each of these.

Deserializer

NBT is a self-describing format. This means that for the most part we forward deserialize calls to deserialize_any. By the time we reach this call, we've already parsed the NBT tag which tells us the type we're about to deserialize. We can match on this to call the appropriate visitor methods. Here is a cut down version of deserialize_any:

fn deserialize_any<V>(mut self, v: V) -> Result<V::Value>
where
    V: de::Visitor<'de>,
{
    match self.tag {
        Tag::Byte => v.visit_i8(self.de.input.consume_byte()? as i8),
        Tag::Compound => v.visit_map(MapAccess::new(self.de)),
        Tag::LongArray => {
            let len = self.de.input.consume_i32()?;
            v.visit_map(ArrayWrapperAccess::longs(self.de, len)?)
        }
        // ... snip
    }
}

Deserialize methods on the deserializer call the appropriate visit methods depending the data we find. You can see if the type is Byte, we call visit_i8, because this is what an NBT byte is. We consume that byte from our input and provide it to the visitor.

NBT Compound types are more complicated. Compounds are basically objects, they have named fields with values. In Serde's data model, these are called maps, and are consumed by types meeting the MapAccess trait. Types implementing this trait need to provide methods to extract the key and values from the map.

We do the same thing for our LongArray type. Despite this logically being a series of i64, we're not calling the visit methods for that. Instead we call visit_map with a specific MapAccess implementation called ArrayWrapperAccess:

impl<'de, 'a, In: Input<'de> + 'a> de::MapAccess<'de> for ArrayWrapperAccess<'a, In> {
    type Error = Error;

    fn next_key_seed<K>(&mut self, seed: K) -> Result<Option<K::Value>>
    where
        K: de::DeserializeSeed<'de>,
    {
        if let State::Unread = self.state {
            self.state = State::Read;
            seed.deserialize(BorrowedStrDeserializer::new(self.token))
                .map(Some)
        } else {
            Ok(None)
        }
    }

    fn next_value_seed<V>(&mut self, seed: V) -> Result<V::Value>
    where
        V: de::DeserializeSeed<'de>,
    {
        let data = self
            .de
            .input
            .consume_bytes(self.bytes_size, &mut self.de.scratch)?;

        match data {
            Reference::Borrowed(bs) => seed.deserialize(BorrowedBytesDeserializer::new(bs)),
            Reference::Copied(bs) => seed.deserialize(BytesDeserializer::new(bs)),
        }
    }
}

The main trick here is to use Serde's helper types such as BorrowedStrDeserializer and BytesDeserializer. Using these we can artificially create a map. These helper types tend to just implement Deserializer in a way that calls the visit method matching the name of the helper, eg BorrowedStrDeserialize ends up calling visit_borrowed_str.

The self.token in next_key_seed is simply our __fastnbt_long_array string that we want to be the only key in the map.

Custom Deserialize implementation

We can now create a type that matches the model, eg

struct LongArray {
    __fastnbt_long_array: serde_bytes::Bytes
}

But in order to do some extra processing to these bytes, and to use a friendlier field name, we can implement Deserialize ourselves:

impl<'de> Deserialize<'de> for LongArray {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: serde::Deserializer<'de>,
    {
        deserializer.deserialize_map(InnerVisitor)
    }
}

struct InnerVisitor;
impl<'de> Visitor<'de> for InnerVisitor {
    type Value = LongArray;

    fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
        formatter.write_str("long array")
    }

    fn visit_map<A>(self, mut map: A) -> Result<Self::Value, A::Error>
    where
        A: serde::de::MapAccess<'de>,
    {
        let token = map.next_key::<&str>()?.ok_or_else(|| {
            serde::de::Error::custom("expected NBT long array token, but got empty map")
        })?;
        let data = map.next_value::<ByteBuf>()?;

        match token {
            LONG_ARRAY_TOKEN => LongArray::from_bytes(&data)
                .map_err(|_| serde::de::Error::custom("could not read i64 for long array")),
            _ => Err(serde::de::Error::custom("expected NBT long array token")),
        }
    }
}

Here we have our custom visitor that extracts the key and value as the correct types. We then create our real LongArray type from those bytes. LongArray::from_bytes converts these bytes into a Vec<i64>, taking into account endianness.

We could create an inner type with the correct Serde model and derive Deserialize on that, call deserialize on it and then convert to the real type. The above way gives better opportunity for descriptive error messages.

Serializer

Serializing NBT with Serde is a little bit complicated. This is because the order of serialize calls does not neatly map to the serialized NBT data. NBT is structured as {tag}{name}{value}. When we serialize a map, we get the serialize key call first, followed by the serialize value. But we need to write the tag before we write the key. But we don't know the tag because that depends on the type of the value we're going to serialize, which we haven't seen yet.

This means FastNBT has to 'delay' writing data until it has the information it needs. This is made worse by the fact we cannot even write the tag for the overall compound we're writing, because at this point we might actually be serializing an array type, not a compound. We need to see that first key!

Here is one of the serialize_map methods from FastNBT:

fn serialize_map(self, _len: Option<usize>) -> Result<Self::SerializeMap> {
    Ok(SerializerMap {
        ser: self.ser,
        header: self.header.take(),
        trailer: Some(Tag::End),
    })
}

We simply create our SerializeMap type. The header is the delayed information we still need to write out. The main part of this type is the serialize_entry method:

fn serialize_entry<K: ?Sized, V: ?Sized>(&mut self, key: &K, value: &V) -> Result<()>
where
    K: Serialize,
    V: Serialize,
{
    // We're expected to serialize the key first. This gets us the name.
    // We cannot write it out yet because we do not yet have the tag type.
    let mut name = Vec::new();
    key.serialize(&mut NameSerializer { name: &mut name })?;

    let outer_tag = match std::str::from_utf8(&name) {
        // ...
        Ok(LONG_ARRAY_TOKEN) => Tag::LongArray,
        _ => Tag::Compound,
    };

    if let Some(header) = self.header.take() {
        write_header(&mut self.ser.writer, header, outer_tag)?;
    }

    match std::str::from_utf8(&name) {
        // ...
        Ok(LONG_ARRAY_TOKEN) => {
            self.trailer = None;
            value.serialize(ArraySerializer {
                ser: self.ser,
                tag: Tag::LongArray,
            })
        }
        _ => value.serialize(&mut Delayed {
            ser: &mut *self.ser,
            header: Some(DelayedHeader::MapEntry { outer_name: name }),
            is_list: false,
        }),
    }
}

(NB: This should be implemented as serialize_key and serialize_value separately, I need to fix that.)

Here we can finally get the key and see if it matches our token string __fastnbt_long_array. If it does we divert serialization to the ArraySerializer that takes care of writing out the value (which we know will be modelled at bytes).

If it doesn't match the token, we know we're actually in a compound, and can serialize the map entry.

Custom Serialize implementation

Finally, we can write a quick custom implementation of Serialize for our LongArray:

impl Serialize for LongArray {
    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
    where
        S: serde::Serializer,
    {
        #[derive(Serialize)]
        struct Inner {
            __fastnbt_long_array: ByteBuf,
        }

        Inner {
            __fastnbt_long_array: ByteBuf::from(self.to_bytes()),
        }
        .serialize(serializer)
    }
}

We just create a hidden type that matches the model, and call deserialize on that. Easy.

That completes the whole loop, and hopefully demonstrates a real case of mapping into the Serde data model.