Understanding Rust's serde using macro expansion
While I was writing fastnbt, I struggled to find an in depth explanation of how to write a deserializer with serde. I want to explore how serde works using cargo-expand.
This article expects familiarity with Rust, and at least a little experience using the de facto serialization/deserialization library serde.
Expansive mess
cargo expand
is a custom subcommand for Cargo that lets you print the results
of expanding a macro. Let's try it for a simple Deserialize
macro:
#[derive(Deserialize)] struct Human { name: String, }
Here we simply have a Human
struct
that contains a name. We derive an
implementation of the Deserialize
trait. If we run cargo expand...
cargo install cargo-expand cargo expand
...then we get the incredibly short... (don't spend time looking at this)
#[doc(hidden)] #[allow(non_upper_case_globals, unused_attributes, unused_qualifications)] const _: () = { #[allow(unused_extern_crates, clippy::useless_attribute)] extern crate serde as _serde; #[automatically_derived] impl<'de> _serde::Deserialize<'de> for Human { fn deserialize<__D>(__deserializer: __D) -> _serde::__private::Result<Self, __D::Error> where __D: _serde::Deserializer<'de>, { #[allow(non_camel_case_types)] enum __Field { __field0, __ignore, } struct __FieldVisitor; impl<'de> _serde::de::Visitor<'de> for __FieldVisitor { type Value = __Field; fn expecting( &self, __formatter: &mut _serde::__private::Formatter, ) -> _serde::__private::fmt::Result { _serde::__private::Formatter::write_str(__formatter, "field identifier") } fn visit_u64<__E>(self, __value: u64) -> _serde::__private::Result<Self::Value, __E> where __E: _serde::de::Error, { match __value { 0u64 => _serde::__private::Ok(__Field::__field0), _ => _serde::__private::Ok(__Field::__ignore), } } fn visit_str<__E>( self, __value: &str, ) -> _serde::__private::Result<Self::Value, __E> where __E: _serde::de::Error, { match __value { "name" => _serde::__private::Ok(__Field::__field0), _ => _serde::__private::Ok(__Field::__ignore), } } fn visit_bytes<__E>( self, __value: &[u8], ) -> _serde::__private::Result<Self::Value, __E> where __E: _serde::de::Error, { match __value { b"name" => _serde::__private::Ok(__Field::__field0), _ => _serde::__private::Ok(__Field::__ignore), } } } impl<'de> _serde::Deserialize<'de> for __Field { #[inline] fn deserialize<__D>( __deserializer: __D, ) -> _serde::__private::Result<Self, __D::Error> where __D: _serde::Deserializer<'de>, { _serde::Deserializer::deserialize_identifier(__deserializer, __FieldVisitor) } } struct __Visitor<'de> { marker: _serde::__private::PhantomData<Human>, lifetime: _serde::__private::PhantomData<&'de ()>, } impl<'de> _serde::de::Visitor<'de> for __Visitor<'de> { type Value = Human; fn expecting( &self, __formatter: &mut _serde::__private::Formatter, ) -> _serde::__private::fmt::Result { _serde::__private::Formatter::write_str(__formatter, "struct Human") } #[inline] fn visit_seq<__A>( self, mut __seq: __A, ) -> _serde::__private::Result<Self::Value, __A::Error> where __A: _serde::de::SeqAccess<'de>, { let __field0 = match match _serde::de::SeqAccess::next_element::<String>(&mut __seq) { _serde::__private::Ok(__val) => __val, _serde::__private::Err(__err) => { return _serde::__private::Err(__err); } } { _serde::__private::Some(__value) => __value, _serde::__private::None => { return _serde::__private::Err(_serde::de::Error::invalid_length( 0usize, &"struct Human with 1 element", )); } }; _serde::__private::Ok(Human { name: __field0 }) } #[inline] fn visit_map<__A>( self, mut __map: __A, ) -> _serde::__private::Result<Self::Value, __A::Error> where __A: _serde::de::MapAccess<'de>, { let mut __field0: _serde::__private::Option<String> = _serde::__private::None; while let _serde::__private::Some(__key) = match _serde::de::MapAccess::next_key::<__Field>(&mut __map) { _serde::__private::Ok(__val) => __val, _serde::__private::Err(__err) => { return _serde::__private::Err(__err); } } { match __key { __Field::__field0 => { if _serde::__private::Option::is_some(&__field0) { return _serde::__private::Err( <__A::Error as _serde::de::Error>::duplicate_field("name"), ); } __field0 = _serde::__private::Some( match _serde::de::MapAccess::next_value::<String>(&mut __map) { _serde::__private::Ok(__val) => __val, _serde::__private::Err(__err) => { return _serde::__private::Err(__err); } }, ); } _ => { let _ = match _serde::de::MapAccess::next_value::< _serde::de::IgnoredAny, >(&mut __map) { _serde::__private::Ok(__val) => __val, _serde::__private::Err(__err) => { return _serde::__private::Err(__err); } }; } } } let __field0 = match __field0 { _serde::__private::Some(__field0) => __field0, _serde::__private::None => { match _serde::__private::de::missing_field("name") { _serde::__private::Ok(__val) => __val, _serde::__private::Err(__err) => { return _serde::__private::Err(__err); } } } }; _serde::__private::Ok(Human { name: __field0 }) } } const FIELDS: &'static [&'static str] = &["name"]; _serde::Deserializer::deserialize_struct( __deserializer, "Human", FIELDS, __Visitor { marker: _serde::__private::PhantomData::<Human>, lifetime: _serde::__private::PhantomData, }, ) } } };
We can add some clarity here by
- Replacing private aliases with the more expected form. So
_serde::__private::Result
is actuallystd::result::Result
. - Renaming type parameters to be easier on the eyes, like
__D
to justD
. - Removing some of the annotations like
#[automatically_derived]
. - Removing the wrapping scope ie
const _: () = {...}
. - Moving nested
struct
andimpl
blocks to the top level.
These things are to isolate the expanded code from the code around it. Preventing the expanded code affecting yours, and yours from affecting the expanded code.
A Reddit user pointed out that the wrapping scope was introduced because of GitHub serde issue 159.
There's quite a few types and implementations created by this expansion. Below is a quick summary:
Thing | Description |
impl Deserialize for Human
|
This is exactly what we wanted to derive. |
struct HumanVisitor
|
A visitor that gets called by the deserializer. It's job is to produce
the Human value.
|
enum Field
|
This enum represents the fields of our struct Human , in
our case it simply contains
field0 for 'name', and an
ignore variant.
|
struct FieldVisitor
|
A visitor purely to check identifier-like values produced by the deserializer match our fields. |
Deserialize
implementation
After all that clean up for human eyes, here's our Deserialize
implementation:
impl<'de> serde::Deserialize<'de> for Human { fn deserialize<D>(deserializer: D) -> Result<Self, D::Error> where D: Deserializer<'de>, { const FIELDS: &'static [&'static str] = &["name"]; serde::Deserializer::deserialize_struct( deserializer, "Human", FIELDS, HumanVisitor { marker: PhantomData::<Human>, lifetime: PhantomData, }, ) } }
We can see here that this just delegates to the deserialize_struct
method,
passing some extra information like the names of our fields, and a visitor that
was also generated by the macro. Nothing too complicated here. What's that
HumanVisitor
?
Our visitor
Here's our visitor with some code snipped out for brevity:
struct HumanVisitor<'de> { marker: PhantomData<Human>, lifetime: PhantomData<&'de ()>, } impl<'de> serde::de::Visitor<'de> for HumanVisitor<'de> { type Value = Human; fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result { fmt::Formatter::write_str(formatter, "struct Human") } fn visit_seq<A>(self, mut seq: A) -> Result<Self::Value, A::Error> where A: SeqAccess<'de>, { // ... } fn visit_map<A>(self, mut map: A) -> Result<Self::Value, A::Error> where A: MapAccess<'de>, { // ... } }
The purpose of a visitor is to be driven by the deserializer, constructing the
values as it goes. This visitor in particular is expecting to have methods
called that can be converted to our Human
structure. It would make no sense if
the visitor received an integer type, because that does not represent a
structure.
The visitor only implements the methods that make sense for the type it is
expecting. For struct
s, you would generally expect a map of key-value pairs.
This is why our visitor implements visit_map
. It also implements visit_seq
(seq
for sequence); this supports formats that encode the values in order,
skipping keys.
Default method implementations
The serde::de::Visitor
trait has default implementations of each method, which
simply raise an error or forward the call on to another method. Here's the
default implementation of visit_bool
:
fn visit_bool<E>(self, v: bool) -> Result<Self::Value, E> where E: Error, { Err(Error::invalid_type(Unexpected::Bool(v), &self)) }
So if the deserializer calls the visitor with a bool, but it does not implement
the method, by default you will get an error. Some method implementations add
convenience for the common cases, like visit_u8
, which forwards on to
visit_u64
:
fn visit_u8<E>(self, v: u8) -> Result<Self::Value, E> where E: Error, { self.visit_u64(v as u64) }
This makes some sense. If you are making a visitor that accepts unsigned integer
types, you can implement visit_u64
and handle unsigned integers of smaller
types like u32
for free. This can make untagged enums containing these
forwarded types difficult however, forcing us to create stricter
deserializers.
Map access
Let's take a look at the code for visit_map
:
fn visit_map<A>(self, mut map: A) -> Result<Self::Value, A::Error> where A: MapAccess<'de>, { let mut field0: Option<String> = None; while let Some(key) = MapAccess::next_key::<Field>(&mut map)? { match key { Field::field0 => { if Option::is_some(&field0) { return Err(<A::Error as Error>::duplicate_field("name")); } field0 = Some(MapAccess::next_value::<String>(&mut map)?); } _ => { let _ = MapAccess::next_value::<serde::de::IgnoredAny>(&mut map)?; } } } let field0 = match field0 { Some(field0) => field0, None => serde::__private::de::missing_field("name")?, }; Ok(Human { name: field0 }) }
This function iterates through the given maps keys and values, looking for any
of the fields of the structure. In our case we have the single field 'name'
which is encoded in the Field
type which we will look at in a second.
For each key, it checks if it is one of our fields. If it is, it tries to get
the value as the expected String
type. It ignores fields that are not in our
structure. Finally it checks it has all the required fields and returns our
Human
.
The Field
types
The Field
and related types looks like this:
enum Field { field0, ignore, } impl<'de> serde::Deserialize<'de> for Field { fn deserialize<D>(deserializer: D) -> Result<Self, D::Error> where D: serde::Deserializer<'de>, { Deserializer::deserialize_identifier(deserializer, FieldVisitor) } } struct FieldVisitor; impl<'de> serde::de::Visitor<'de> for FieldVisitor { type Value = Field; fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result { fmt::Formatter::write_str(formatter, "field identifier") } fn visit_u64<E>(self, value: u64) -> Result<Self::Value, E> where E: Error, { match value { 0u64 => Ok(Field::field0), _ => Ok(Field::ignore), } } fn visit_str<E>(self, value: &str) -> Result<Self::Value, E> where E: Error, { match value { "name" => Ok(Field::field0), _ => Ok(Field::ignore), } } fn visit_bytes<E>(self, value: &[u8]) -> Result<Self::Value, E> where E: Error, { match value { b"name" => Ok(Field::field0), _ => Ok(Field::ignore), } } }
We have a very similar situation to our Human
visitor here. Deserialize
is
implemented for the Field
enum, and it just passes on to
deserialize_identifer
passing the FieldVisitor
along.
This FieldVisitor
expects the deserializer to call methods on it that 'look
like' field identifiers. So it implements visit_str
and visit_bytes
, both of
which see if the deserialized value looks like one of our fields. If it doesn't,
the field gets deserialized to the special Field::ignore
variant.
There is also the visit_u64
method, which allows the field name to be the
number of the field; zero in our case.
Conclusion
This was a brief look into how serde deserializes data into values. If you would like more detailed information about this, let me know via whatever medium you found this on.
Some things I think I would like to expand upon are:
- How optional types, nested structures, enums, bytes, and strings are handled.
- How to write a deserializer for a data format.
- How borrowing from underlying data is provided.
Thanks for reading!