Mapping into the serde data model
The official Serde documentation talks about mapping into the data
model and gives a
demonstration using OsString
. I want to walk through an example of doing this,
showing the code to achieve it as we go. This assumes you already know the
basics of implementing a Serde data format.
'Mapping into the data model' is when the mapping from the data format to the
Serde data model is not straight forward. Usually you're doing it to capture extra
information. For OsString
, modelling it as an enum allows the deserializer
to know the platform it was originally made on, and error if they are not compatible.
Worked example: Minecraft's arrays
The fastnbt
crate for Minecraft's Named
Binary Tag (NBT) format utilizes mapping into the Serde data model to accurately
preserve NBT's redundant types.
NBT has a general list type for storing homogeneous NBT types, but also has byte arrays, integer arrays, and long arrays. This means that there are multiple ways of representing the same data in NBT.
Should a Vec<u8>
serialize to a list of bytes, or a byte array? If a list of
bytes deserializes to Vec<u8>
, what does a byte array deserialize to? If
they both became a Vec, you can't reliably serialize back to the same NBT data.
Minecraft will not accept the wrong one, so it's important that FastNBT
preserves this information.
This is a similar problem to OsString
. Ultimately you want to encode more
information in the Rust object. This might be to prevent errors, improve
correctness, or improve performance.
fastnbt
handles this by mapping the NBT array types into the serde data model
as Maps containing a single key/value, where the key has a special value. If you
could see the data model, an NBT long array might look like this:
Map key: string("__fastnbt_long_array") value: bytes([1,2,3,...]) end Map
There are several pieces to make this work:
- the deserializer calls visit methods to mimic a map structure when it sees an NBT array.
- the serializer detects the
__fastnbt_long_array
key and serializes an NBT array. - (optional) provide a custom
Serialize
andDeserialize
implementation.
Let's go through each of these.
Deserializer
NBT is a self-describing format. This means that for the most part we forward
deserialize calls to deserialize_any
. By the time we reach this call, we've
already parsed the NBT tag which tells us the type we're about to deserialize.
We can match on this to call the appropriate visitor methods. Here is a cut down
version of deserialize_any
:
fn deserialize_any<V>(mut self, v: V) -> Result<V::Value> where V: de::Visitor<'de>, { match self.tag { Tag::Byte => v.visit_i8(self.de.input.consume_byte()? as i8), Tag::Compound => v.visit_map(MapAccess::new(self.de)), Tag::LongArray => { let len = self.de.input.consume_i32()?; v.visit_map(ArrayWrapperAccess::longs(self.de, len)?) } // ... snip } }
Deserialize methods on the deserializer call the appropriate visit methods
depending the data we find. You can see if the type is Byte
, we call
visit_i8
, because this is what an NBT byte is. We consume that byte from our
input and provide it to the visitor.
NBT Compound types are more complicated. Compounds are basically objects, they have
named fields with values. In Serde's data model, these are called maps, and are
consumed by types meeting the MapAccess
trait. Types implementing this trait
need to provide methods to extract the key and values from the map.
We do the same thing for our LongArray
type. Despite this logically being a
series of i64
, we're not calling the visit methods for that. Instead we call
visit_map
with a specific MapAccess
implementation called
ArrayWrapperAccess
:
impl<'de, 'a, In: Input<'de> + 'a> de::MapAccess<'de> for ArrayWrapperAccess<'a, In> { type Error = Error; fn next_key_seed<K>(&mut self, seed: K) -> Result<Option<K::Value>> where K: de::DeserializeSeed<'de>, { if let State::Unread = self.state { self.state = State::Read; seed.deserialize(BorrowedStrDeserializer::new(self.token)) .map(Some) } else { Ok(None) } } fn next_value_seed<V>(&mut self, seed: V) -> Result<V::Value> where V: de::DeserializeSeed<'de>, { let data = self .de .input .consume_bytes(self.bytes_size, &mut self.de.scratch)?; match data { Reference::Borrowed(bs) => seed.deserialize(BorrowedBytesDeserializer::new(bs)), Reference::Copied(bs) => seed.deserialize(BytesDeserializer::new(bs)), } } }
The main trick here is to use Serde's helper types such as
BorrowedStrDeserializer
and BytesDeserializer
. Using these we can
artificially create a map. These helper types tend to just implement
Deserializer
in a way that calls the visit method matching the name of the
helper, eg BorrowedStrDeserialize
ends up calling visit_borrowed_str
.
The self.token
in next_key_seed
is simply our __fastnbt_long_array
string
that we want to be the only key in the map.
Custom Deserialize
implementation
We can now create a type that matches the model, eg
struct LongArray { __fastnbt_long_array: serde_bytes::Bytes }
But in order to do some extra processing to these bytes, and to use a friendlier
field name, we can implement Deserialize
ourselves:
impl<'de> Deserialize<'de> for LongArray { fn deserialize<D>(deserializer: D) -> Result<Self, D::Error> where D: serde::Deserializer<'de>, { deserializer.deserialize_map(InnerVisitor) } } struct InnerVisitor; impl<'de> Visitor<'de> for InnerVisitor { type Value = LongArray; fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result { formatter.write_str("long array") } fn visit_map<A>(self, mut map: A) -> Result<Self::Value, A::Error> where A: serde::de::MapAccess<'de>, { let token = map.next_key::<&str>()?.ok_or_else(|| { serde::de::Error::custom("expected NBT long array token, but got empty map") })?; let data = map.next_value::<ByteBuf>()?; match token { LONG_ARRAY_TOKEN => LongArray::from_bytes(&data) .map_err(|_| serde::de::Error::custom("could not read i64 for long array")), _ => Err(serde::de::Error::custom("expected NBT long array token")), } } }
Here we have our custom visitor that extracts the key and value as the correct
types. We then create our real LongArray
type from those bytes.
LongArray::from_bytes
converts these bytes into a Vec<i64>
, taking into
account endianness.
We could create an inner type with the correct Serde model and derive
Deserialize
on that, call deserialize on it and then convert to the real type.
The above way gives better opportunity for descriptive error messages.
Serializer
Serializing NBT with Serde is a little bit complicated. This is because the
order of serialize calls does not neatly map to the serialized NBT data. NBT is
structured as {tag}{name}{value}
. When we serialize a map, we get the
serialize key call first, followed by the serialize value. But we need to write
the tag before we write the key. But we don't know the tag because that depends
on the type of the value we're going to serialize, which we haven't seen yet.
This means FastNBT has to 'delay' writing data until it has the information it needs. This is made worse by the fact we cannot even write the tag for the overall compound we're writing, because at this point we might actually be serializing an array type, not a compound. We need to see that first key!
Here is one of the serialize_map
methods from FastNBT:
fn serialize_map(self, _len: Option<usize>) -> Result<Self::SerializeMap> { Ok(SerializerMap { ser: self.ser, header: self.header.take(), trailer: Some(Tag::End), }) }
We simply create our SerializeMap
type. The header is the delayed information
we still need to write out. The main part of this type is the serialize_entry
method:
fn serialize_entry<K: ?Sized, V: ?Sized>(&mut self, key: &K, value: &V) -> Result<()> where K: Serialize, V: Serialize, { // We're expected to serialize the key first. This gets us the name. // We cannot write it out yet because we do not yet have the tag type. let mut name = Vec::new(); key.serialize(&mut NameSerializer { name: &mut name })?; let outer_tag = match std::str::from_utf8(&name) { // ... Ok(LONG_ARRAY_TOKEN) => Tag::LongArray, _ => Tag::Compound, }; if let Some(header) = self.header.take() { write_header(&mut self.ser.writer, header, outer_tag)?; } match std::str::from_utf8(&name) { // ... Ok(LONG_ARRAY_TOKEN) => { self.trailer = None; value.serialize(ArraySerializer { ser: self.ser, tag: Tag::LongArray, }) } _ => value.serialize(&mut Delayed { ser: &mut *self.ser, header: Some(DelayedHeader::MapEntry { outer_name: name }), is_list: false, }), } }
(NB: This should be implemented as serialize_key
and serialize_value
separately, I need to fix that.)
Here we can finally get the key and see if it matches our token string
__fastnbt_long_array
. If it does we divert serialization to the
ArraySerializer
that takes care of writing out the value (which we know will
be modelled at bytes).
If it doesn't match the token, we know we're actually in a compound, and can serialize the map entry.
Custom Serialize
implementation
Finally, we can write a quick custom implementation of Serialize
for our
LongArray
:
impl Serialize for LongArray { fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error> where S: serde::Serializer, { #[derive(Serialize)] struct Inner { __fastnbt_long_array: ByteBuf, } Inner { __fastnbt_long_array: ByteBuf::from(self.to_bytes()), } .serialize(serializer) } }
We just create a hidden type that matches the model, and call deserialize on that. Easy.
That completes the whole loop, and hopefully demonstrates a real case of mapping into the Serde data model.