What is Avro ?

Avro is a data serialisation and RPC system like protobuff and thrift. It relies on a schema-based system. This schema is in JSON which is an advantage as most languages already have JSON libraries. Avro was originally developed by Doug Cutting to provide data serialisation and data exchange services for hadoop. It has since evolved to serve other technologies apart from hadoop.

Theoretically, Avro can be used in any language but it has APIs for PHP, Java, Perl, Python, C, C++, C#, Go, Haskell and Ruby.

Where Avro shines

Comparing Avro with similar systems (protobuff, thrift, message pack), these are the areas where it differs :

  • Dynamic typing:

    Serialisation and deserialization can be done without code generation. Code generation is also available for statically typed languages as an optional optimisation.

  • Schema evolution:

    Avro requires schemas when data is written or read. Most interesting is that you can use different schemas for serialisation and deserialization, and Avro will handle the missing/extra/modified fields.

  • Untagged data:

    Other serialisation systems like protobuff tag data but providing a schema with binary data allows each datum be written without the overhead. The result is more compact data encoding and faster data processing.

Why you need Avro (or serialisation libraries generally):

  • You save storage space. Uber drastically reduced their storage needs by implementing data serialization and compression.
  • It significantly reduces the amount of time spent processing data.
  • Translates to less money saved on provisioning hardware.

Data types in Avro

Avro supports a range of data types. These types could be either primitive, complex or logical types.

Primitive types:

  1. null : no value
  2. boolean : a binary value
  3. int : 32-bit signed integer
  4. long : 64-bit signed integer
  5. float : single precision (32-bit) IEEE 754 floating-point number
  6. double : double precision (64-bit) IEEE 754 floating-point number
  7. bytes : sequence of 8-bit unsigned bytes
  8. string : Unicode character sequence

Complex Types:

Avro also provides complex data types which are majorly a combination of primitive types. The complex types are:

  1. Records:

    A record is a collection of multiple types. A record is equivalent to a JSON object or a dictionary in python.

    {
        "type":"record",
        "name":"Point",
        "fields":[
            {
                "name":"x",
                "type":"int"
            },
            {
                "name":"y",
                "type":"int"
            }
        ]
    }
  2. Enum:

    An enumeration is a list of items in a collection.

    {
        "type":"enum",
        "name":"Suit",
        "symbols":[
            "SPADES",
            "HEARTS",
            "DIAMONDS",
            "CLUBS"
        ]
    }
  3. Arrays:

    { " type " : " array ", " items " : " int " }
  4. Maps:

    Map keys are assumed to be strings. Eg :

    {"type": "map", "values": "long"}
  5. Unions:

    Unions are represented using JSON arrays.

    ["null", "string"] declares a schema which may be either a null or string.

  6. Fixed:

    This data type is used to declare a fixed-sized field that can be used for storing binary data. It has field name and data as attributes. Name holds the name of the field, and size holds the size of the field.

Defining a schema

As indicated, Avro schemas are defined with JSON. The Schema is a representation of the data to be serialized in avro data types.

Creating a schema for this Object

{
    "products":[
        {
            "id":1,
            "name":"A green door",
            "price":12.50,
            "tags":[
                "home",
                "green"
            ]
        },
        {
            "id":1,
            "name":"A green door",
            "price":12.50,
            "tags":[
                "home",
                "green"
            ]
        },
        {
            "id":1,
            "name":"A green door",
            "price":12.50,
            "tags":[
                "home",
                "green"
            ]
        }
    ]
}

would result in :

{
    "namespace":"com.idarlington.avro",
    "type":"record",
    "name":"stream",
    "fields":[
        {
            "name":"products",
            "type":{
                "type":"array",
                "items":{
                    "name":"product",
                    "type":"record",
                    "fields":[
                        {
                            "name":"id",
                            "type":"integer"
                        },
                        {
                            "name":"name",
                            "type":"string"
                        },
                        {
                            "name":"price",
                            "type":"float"
                        },
                        {
                            "name":"tags",
                            "type":{
                                "type":"array",
                                "items":"string"
                            }
                        }
                    ]
                }
            }
        }
    ]
}

To validate and test your avro schema, you can check this out.

My next post would be on serialzing and deserializing with Avro. If you like this post, please share ♥.