One of the first techniques taught in a programming languages course is how to inductively express a language using context-free grammars. Although this technique is familiar to many of us with a computer science background, I have found this technique useful for enumerating all of the possible types in a type system, e.g. Avro.
Avro is a serialization system that is common in the Hadoop ecosystem. If you have written MapReduce jobs or Spark jobs, there is a high likelihood that your data is serialized according to the Avro specification. Avro specifies how data should be serialized as binary data. In other words, it specifies a type system. With the existence of user-defined types, the set of all concrete Avro types is infinite in size! Is there a more tangible, finite representation of this set?
We explore the answer to this question by attempting to represent this set in a finite form using context-free grammars.
Avro specification
To begin our quest, we refer to the current [Schema Declaration][1] section of the Avro Specification for version 1.8.1.
A schema is a specification of how the data is formatted.
The schema declaration section is straightforward to read; however, it is not explicit about the the exhaustive set of types that are possible in Avro. To construct an inductive representation of types, we distill this specification into a context-free grammar that describes the types as they are specified.
EBNF grammar notation
For notation, we use a variant of EBNF specified by W3. To summarize our use of the notation:
NONTERMINAL
symbols are uppercase.terminal
symbols are lowercase.=
declares a production.|
declares a choice in a production.*
declares that the preceding symbol may appear zero or more times.(x)
declares a grouping of the metavariablex
.
As a simple example, the following grammar describes numeric values with a
single literal 0
.
NUM_VALUE
= 0
| succ NUM_VALUE
Using this notation, we may now proceed to describe our context-free grammar for Avro types.
Grammar for Avro types
Our goal is to recognize every possible concrete Avro type with this context-free grammar.
First, let AVRO
be the start symbol that describes any Avro type. There are
two classes of types according to the specification: primitive types and
complex types. Let PRIMITIVE
be the nonterminal that describes the set of
primitive types and let COMPLEX
be the nonterminal that describes the set of
complex types. Our current grammar is as follows.
AVRO
= PRIMITIVE
| COMPLEX
From here, we enumerate the primitive and complex type classes.
Primitive types
Starting with the primitive type class, we know that a primitive type must be
one of null
, boolean
, int
, long
, float
, double
, bytes
, or
string
. We represent these concrete types as terminals.
PRIMITIVE
= null
| boolean
| int
| long
| float
| double
| bytes
| string
The class of primitive types is trivial because it only contains conrete types. The class of complex types, however, contains parametric types, i.e. they refer to other Avro types.
Complex types
There are six complex types: record
, enum
, array
, map
, union
, and
fixed
. Of these, all but the fixed
type are parametric. To represent the
parametric types, we use a nonterminal in place of the type parameter. For
example, we use the notation enum of AVRO
where enum
is a terminal
and AVRO
is a nonterminal representing any Avro type. We also use curly brace
literals to avoid ambiguity in our grammar.
COMPLEX
= record of { (AVRO,)* AVRO }
| enum of AVRO
| array of AVRO
| map of AVRO
| union of { (AVRO,)* AVRO }
| fixed
Note that only the record
and union
types may be parameterized by multiple
Avro types.
From here, we have inductively defined all conrete Avro types. The set of concrete Avro types is infinite, but the grammar is finite.
Example
As an example, consider a union of a string and an array of integers. This would be represented as the following expression in our grammar:
union of { string, array of int }
Consequences
There are a few interesting consequences of describing the set of Avro types with a context-free grammar. The immediate consequence is that any Avro type can be recognized by a Push Down Automata. We can also use the pumping lemma to show that the set of Avro types cannot be described by a regular grammar, but this is beyond the scope of this post.
Applications
You may be curious why we would even bother to represent all Avro types with a context-free grammar. If you design serialization libraries, e.g. Avro.jl, then this may be useful to designing schema parsers. After all, context-free grammars are typically used in the context of parsing.