Exploring serde's data model with a toy deserializer
Previously I did a shallow dive into understanding serde
by expanding the
Deserialize
macro. This time I'll go deeper and map how each type in serde's
data model is treated.
This is written for people intending to write their own deserializer for
serde
, but might be interesting for more people. Later in the article we will
implement a toy deserializer for some made up format.
There are three major categories of types to consider:
- Sequences of values (think
Vec
, slice, tuple) - Maps from keys to values (think
HashMap
,struct
) - Everything else
The 'Everything else' turns out to be the easiest, so lets start there.
Table of everything else
The basic steps followed between you and serde
when writing a deserializer is:
Deserialize::deserialize
is called on some type.- This calls a
deserialize_*
method on your deserializer, passing you aVistitor
. - You call appropriate
visit_*
methods for that call.
For example, calling deserialize
on a String
calls deserialize_string
on
your deserializer, you deserialize a string from your input and call
visit_string
on the visitor you were passed.
When writing a deserializer I wish I knew
- what real Rust types would end up calling my
deserialize_*
methods. - which
visit_*
methods my deserializer should call in response.
Before answering these questions, lets look at the Visitor
trait. This trait
has default implementations of each method. Some return an error, and others
forward the call to another visitor method.
This will be important later, since it means our deserializer can call a larger set of methods than you would naïvely expect.
Here's a summary of all the forwarded methods:
Visitor method
|
Forwarded to |
---|---|
visit_i8
|
visit_i64
|
visit_i16
|
|
visit_i32
|
|
visit_u8
|
visit_u64
|
visit_u16
|
|
visit_u32
|
|
visit_f32
|
visit_f64
|
visit_char
|
visit_str
|
visit_borrowed_str
|
visit_str
|
visit_string
|
visit_str
|
visit_borrowed_bytes
|
visit_bytes
|
visit_byte_buf
|
visit_bytes
|
(For completeness, the methods that do not forward are visit_bool
, visit_i64
, visit_u64
,
visit_f64
, visit_str
, visit_bytes
, visit_none
, visit_some
,
visit_unit
, visit_newtype_struct
, visit_seq
, visit_map
, and visit_enum
.)
Okay, back to who calls and expects what.
This table is mostly put together from impls.rs
in
serde
.
Each row has a Rust type that is being deserialized, the deserialize call that
will be called on the deserializer, and finally the acceptable method calls on
the visitor passed to the deserializer.
type being deserialized |
Deserializer method called
|
Expected Visitor method calls
|
---|---|---|
struct field identifier |
deserialize_identifer
|
visit_str , visit_bytes , but also{" "}
visit_u64 where the number is field number. (forwarded
from: visit_char ,visit_borrowed_str ,{" "}
visit_string ,visit_borrowed_bytes ,{" "}
visit_byte_buf , visit_{u8,16,32} )
|
bool
|
deserialize_bool
|
visit_bool
|
()
|
deserialize_unit
|
visit_unit
|
i8
|
deserialize_i8
|
|
i16
|
deserialize_i16
|
|
i32
|
deserialize_i32
|
|
i64
|
deserialize_i64
|
|
isize
|
deserialize_i64
|
|
u8
|
deserialize_u8
|
|
u16
|
deserialize_u16
|
|
u32
|
deserialize_u32
|
|
u64
|
deserialize_u64
|
|
usize
|
deserialize_u64
|
|
f32
|
deserialize_f32
|
|
f64
|
deserialize_f64
|
|
char
|
deserialize_char
|
visit_char , visit_str erroring if string
length not exactly 1. (forwarded from: visit_borrowed_str ,{" "}
visit_string )
|
String
|
deserialize_string
|
visit_char ,
visit_borrowed_str
,visit_borrowed_bytes )
|
&'a str
|
deserialize_str
|
|
&'a [u8]
|
deserialize_bytes
|
|
Option<T>
|
deserialize_option
|
visit_some , visit_none ,{" "}
visit_unit
|
PhantomData<T>
|
deserialize_unit_struct
|
visit_unit
|
Vec<T>
|
deserialize_seq
|
visit_seq
|
[T; N]
|
deserialize_tuple (!)
|
|
tuples eg (T,U)
|
deserialize_tuple
|
|
BTreeMap<K, V>
|
deserialize_map
|
visit_map
|
HashMap<K, V>
|
Most of this is unsurprising. Some things that sticks out to me are:
- an array causes
deserialize_tuple
to be called. I guess it is similar to tuples in that arrays and tuples are fixed sizes, unlike aVec
. - conversions to floating point use the
as
keyword to convert from integer types. The reference shows that this will convert to the nearest float, and that overflow to infinity can only happen foru128
tof32
1. So this conversion seems sensible. - you cannot
visit_bytes
for deserializing achar
like you would for aString
.
Sequences
Sequences are the next level up in difficulty for our category of types.
When a type like a Vec<T>
is deserialized, it calls deserialize_seq
on the
Vec
, passing in a visitor with visit_seq
implemented.
Here's the actual implementation in impls.rs
:
impl<'de, T> Deserialize<'de> for Vec<T> where T: Deserialize<'de>, { fn deserialize<D>(deserializer: D) -> Result<Self, D::Error> where D: Deserializer<'de>, { struct VecVisitor<T> { marker: PhantomData<T>, } impl<'de, T> Visitor<'de> for VecVisitor<T> where T: Deserialize<'de>, { type Value = Vec<T>; fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result { formatter.write_str("a sequence") } fn visit_seq<A>(self, mut seq: A) -> Result<Self::Value, A::Error> where A: SeqAccess<'de>, { let mut values = Vec::with_capacity(size_hint::cautious(seq.size_hint())); while let Some(value) = try!(seq.next_element()) { // ❷ values.push(value); } Ok(values) } } let visitor = VecVisitor { marker: PhantomData, }; deserializer.deserialize_seq(visitor) // ❶ } // ... snipped out deserialize_in_place }
The steps here are:
deserialize_seq
gets called with a visitor. ❶- The deserializer sets up an object implementing
SeqAccess
, and callsvisit_seq
with it. - The visitor repeatedly calls
next_element
on theSeqAccess
object untilNone
is returned (or it errors). ❷
This is much like with the Iterator
trait, but fallible. Some other stuff
happens, like the visitor asking for a hint to the size for efficiency, but a
SeqAccess
object doesn't have to provide this.
Let's look at the SeqAccess
trait:
pub trait SeqAccess<'de> { type Error: Error; fn next_element_seed<T>(&mut self, seed: T) -> Result<Option<T::Value>, Self::Error> where T: DeserializeSeed<'de>; fn next_element<T>(&mut self) -> Result<Option<T>, Self::Error> where T: Deserialize<'de>, { self.next_element_seed(PhantomData) } fn size_hint(&self) -> Option<usize> { None } }
The only method without a default implementation is next_element_seed
. If
you're writing a deserializer, this is what you need to implement. When it is
called it is expected that you deserialize a T
from your input and return it,
or None
if it's the end of the sequence.
In practice this usually means you call T::deserialize
repeatedly (recursively
depending on other bits of your deserializer). You just have to correctly figure
out when there are no more elements. This might be easy or hard depending on the
data format you're deserializing.
Example data format
Let's invent a very trivial data format to demonstrate sequences. Our format is simply going to be a list of integers, where each integer is 3 bytes, with a single byte at the front to tell us the length of the sequence. That's it.
| 1 byte | 3 bytes | ... | | -------------- | ------------------ | ------ | | length of ints | big-endian integer | repeat |
We're going to make the following work:
#[test] fn three_byte_format() { let data = [ 3, // single byte for length 0, 0, 1, // first 3 byte int 0, 0, 2, 0, 0, 3]; let mut deserializer = Deserializer::from_bytes(&data); let res = Vec::<i32>::deserialize(&mut deserializer).unwrap(); assert_eq!(&[1, 2, 3], res.as_slice()); }
Let's get the boilerplate our of the way. We need to set up our own custom error, which is just going to be a wrapped String. We aren't valuing good errors in our toy format, sorry:
use std::io::Read; use serde::de::{Error as _, SeqAccess, Visitor}; #[derive(Debug)] struct Error(String); impl std::fmt::Display for Error { fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { f.write_str(&self.0) } } impl std::error::Error for Error {} impl serde::de::Error for Error { fn custom<T>(msg: T) -> Self where T: std::fmt::Display, { Error(msg.to_string()) } } struct Deserializer<'de> { input: &'de [u8], } impl<'de> Deserializer<'de> { pub fn from_bytes(input: &'de [u8]) -> Self { Self { input } } } impl<'de, 'a> serde::de::Deserializer<'de> for &'a mut Deserializer<'de> { serde::forward_to_deserialize_any! { bool i8 i16 i32 i64 u8 u16 u32 u64 f32 f64 char str string byte_buf option unit unit_struct newtype_struct tuple tuple_struct seq map struct enum identifier ignored_any bytes } type Error = Error; fn deserialize_any<V>(self, visitor: V) -> Result<V::Value, Self::Error> where V: serde::de::Visitor<'de>, { use serde::de::Error; Err(Self::Error::custom("expected sequence")) } }
This is pretty much the minimum required to make things compile. There's a few things to note here:
serde
has it's ownError
trait we need to implement that requires acustom
method. This letsserde
report deserialization errors, eg 'exepected seq'.- Our deserializer just takes some bytes. Most real deserializers allow you to
provide anything that implements
Read
. The general approach is the same. - We use
forward_to_deserialize_any!
to implement all of the required methods. Our 'any' just errors. - We're implementing
Deserializer
for&'a mut Deserializer
, not the value itself. This is crucial later.
First thing is to actually implement deserialize_seq
since that is what will be
called when deserializing a Vec<T>
. We add the below implementation and remove
seq
from forward_to_deserialize_any!
.
fn deserialize_seq<V>(mut self, visitor: V) -> Result<V::Value, Self::Error> where V: Visitor<'de>, { let mut length = [0u8; 1]; self.input .read_exact(&mut length) .map_err(|_| Error::custom("read error"))?; let length = length[0] as usize; visitor.visit_seq(OurSeqAccess { inner: &mut self, remaining_length: length, }) }
The main thing we need to do here is get the length of the sequence that's
coming, and get the input to the point where the next thing to be deserialized
is the first element. In our format this is one in the same. For some formats
the length won't exist ahead of time at all (like an array in JSON). As long as
we can find the end it's fine (so ]
in JSON).
We create a new type that implements SeqAccess
that we're calling
OurSeqAccess
. This stores the length and the deserializer. Here's the implementation:
struct OurSeqAccess<'a, 'de> { inner: &'a mut Deserializer<'de>, remaining_length: usize, } impl<'a, 'de> SeqAccess<'de> for OurSeqAccess<'a, 'de> { type Error = Error; fn next_element_seed<T>(&mut self, seed: T) -> Result<Option<T::Value>, Self::Error> where T: serde::de::DeserializeSeed<'de>, { if self.remaining_length > 0 { self.remaining_length -= 1; let el = seed.deserialize(&mut *self.inner)?; Ok(Some(el)) } else { Ok(None) } } }
Some important details
We need the second lifetime 'a
for the lifetime of the deserializer itself.
The 'de
lifetime is that of the input data.
This code is why implementing Deserializer
on the mutable reference is
crucial. If we implemented it on our deserializer directly, we would have to
pass a clone to seed.deserialize
. This would be bad, because when we consume
input, it wouldn't be visible outside of the current function call (the slice in
the original deserializer would not change). By implementing on the &mut
we
can continue to consume input.
Back to our implementation...
So! All we're doing in our next_element_seed
checking there are more elements
to read, then calling seed.deserialize
with our deserializer. At this point we
do not know the concrete type we are currently deserializing, we just have T
.
By calling deserialize
on this T
, we will call the relevant
deserialize_*
method.
Unlike the visitor, each deserialize_*
method does not have a default
implementation. So to support i32
we need to implement exactly
deserialize_i32
on our deserializer (removing it from the
forward_to_deserialize_any!
macro), like so:
fn deserialize_i32<V>(self, visitor: V) -> Result<V::Value, Self::Error> where V: Visitor<'de>, { let mut buf = [0; 3]; match self.input.read_exact(&mut buf) { Ok(_) => { // convert buffer to integer. let mut i = (buf[0] as i32) << 16; i += (buf[1] as i32) << 8; i += buf[2] as i32; visitor.visit_i32(i) } Err(_) => Err(Error::custom("could not read next integer")), } }
Here we try to read exactly 3 bytes. If we're successful we do some bit shifting
to create an i32
and call the visit_i32
method on it.
It is at this point where a visitor implementation might not have this method
implemented, and it would be forwarded to visit_i64
thanks to the default
implementation.
Our test now passes!
#[test] fn three_byte_format() { let data = [ 3, // single byte for length 0, 0, 1, // first 3 byte int 0, 0, 2, 0, 0, 3]; let mut deserializer = Deserializer::from_bytes(&data); let res = Vec::<i32>::deserialize(&mut deserializer).unwrap(); assert_eq!(&[1, 2, 3], res.as_slice()); }
$ cargo test ... running 1 test test tests::three_byte_format ... ok
Beyond a toy
If we changed the Vec<i32>
in the test to Vec<i64>
, then it would fail,
because our deserializer does not implement deserialize_i64
(or rather, it
does, but it forwards to deserialize_any
, which errors).
In a more robust implementation, you would implement all of the numeric methods on the deserializer.
Maps
Maps are the most complex part to handle, and I'm going to leave quite a lot of
it as an excerise to the reader. The structure and handling is very similar to
the seq
stuff we've just implemented, but with keys and values.
Here's a trimmed version of the MapAccess
trait:
pub trait MapAccess<'de> { type Error: Error; fn next_key_seed<K>(&mut self, seed: K) -> Result<Option<K::Value>, Self::Error> where K: DeserializeSeed<'de>; fn next_value_seed<V>(&mut self, seed: V) -> Result<V::Value, Self::Error> where V: DeserializeSeed<'de>; //... }
Similar to needing to implement next_element_seed
for SeqAccess
, we need to
implement next_key_seed
and next_value_seed
.
We'd create one of these in the
deserialize_map
method of our deserializer. We'd set up the input to be ready
to deserialize the first key of the map. Somewhere in these next methods you
would need to get the input ready for the first value, then the second key, etc.
End
These are the major parts for creating a deserializer. I would recommend looking
at the serde docs for a deserializer
for more, as well as looking at existing implementations like
serde_json. My own
fastnbt
crate also has a
deserializer
for the Minecraft NBT format, which might be simpler due to only working with
byte slices.
Thanks for reading! Hope it was helpful.
Footnotes
-
serde
does supportu128
behind a feature flag. It does not do float conversion for 128-bit integers by default) ↩