Apache Avro

Apache Avro
Developer(s)	Apache Software Foundation
Stable release	1.8.2 / May 20, 2017
Repository	https://github.com/apache/avro;
Type	remote procedure call framework
License	Apache License 2.0
Website	avro.apache.org

Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

It is similar to Thrift and Protocol Buffers, but does not require running a code-generation program when a schema changes (unless desired for statically-typed languages).

Apache Spark SQL can access Avro as a data source.^[1]

Avro Object Container File^[2]

An Avro Object Container File consists of:

A file header, followed by
one or more file data blocks.

A file header consists of:

Four bytes, ASCII 'O', 'b', 'j', followed by 1.
file metadata, including the schema definition.
The 16-byte, randomly-generated sync marker for this file.

For data blocks Avro specifies two serialization encodings:^[3] binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.

Schema definition^[4]

Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed).

Simple schema example:

 {
   "namespace": "example.avro",
   "type": "record",
   "name": "User",
   "fields": [
      {"name": "name", "type": "string"},
      {"name": "favorite_number",  "type": ["int", "null"]},
      {"name": "favorite_color", "type": ["string", "null"]}
   ] 
 }

Serializing and deserializing

Data in Avro might be stored with its corresponding schema, meaning a serialized item can be read without knowing the schema ahead of time.

Example serialization and deserialization code in Python^[5]

Serialization:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

schema = avro.schema.parse(open("user.avsc").read())  # need to know the schema to write

writer = DataFileWriter(open("users.avro", "w"), DatumWriter(), schema)
writer.append({"name": "Alyssa", "favorite_number": 256})
writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
writer.close()

File "users.avro" will contain the schema in JSON and a compact binary representation^[6] of the data:

$ od -c users.avro
0000000    O   b   j 001 004 026   a   v   r   o   .   s   c   h   e   m
0000020    a 272 003   {   "   t   y   p   e   "   :       "   r   e   c
0000040    o   r   d   "   ,       "   n   a   m   e   s   p   a   c   e
0000060    "   :       "   e   x   a   m   p   l   e   .   a   v   r   o
0000100    "   ,       "   n   a   m   e   "   :       "   U   s   e   r
0000120    "   ,       "   f   i   e   l   d   s   "   :       [   {   "
0000140    t   y   p   e   "   :       "   s   t   r   i   n   g   "   ,
0000160        "   n   a   m   e   "   :       "   n   a   m   e   "   }
0000200    ,       {   "   t   y   p   e   "   :       [   "   i   n   t
0000220    "   ,       "   n   u   l   l   "   ]   ,       "   n   a   m
0000240    e   "   :       "   f   a   v   o   r   i   t   e   _   n   u
0000260    m   b   e   r   "   }   ,       {   "   t   y   p   e   "   :
0000300        [   "   s   t   r   i   n   g   "   ,       "   n   u   l
0000320    l   "   ]   ,       "   n   a   m   e   "   :       "   f   a
0000340    v   o   r   i   t   e   _   c   o   l   o   r   "   }   ]   }
0000360  024   a   v   r   o   .   c   o   d   e   c  \b   n   u   l   l
0000400   \0 211 266   / 030 334   ˪  **   P 314 341 267 234 310   5 213
0000420    6 004   ,  \f   A   l   y   s   s   a  \0 200 004 002 006   B
0000440    e   n  \0 016  \0 006   r   e   d 211 266   / 030 334   ˪  **
0000460    P 314 341 267 234 310   5 213   6
0000471

Deserialization:

reader = DataFileReader(open("users.avro", "r"), DatumReader())  # no need to know the schema to read
for user in reader:
    print user
reader.close()

This outputs:

{u'favorite_color': None, u'favorite_number': 256, u'name': u'Alyssa'}
{u'favorite_color': u'red', u'favorite_number': 7, u'name': u'Ben'}

Languages with APIs

Though theoretically any language could use Avro, the following languages have APIs written for them:^[7]^[8]

Avro IDL

In addition to supporting JSON for type and protocol definitions, Avro includes experimental^[13] support for an alternative interface description language (IDL) syntax known as Avro IDL. Previously known as GenAvro, this format is designed to ease adoption by users familiar with more traditional IDLs and programming languages, with a syntax similar to C/C++, Protocol Buffers and others.

References

↑ "3 Reasons Why In-Hadoop Analytics are a Big Deal - Dataconomy". dataconomy.com. April 21, 2016.
↑ "Apache Avro™ Specification: Object Container Files". avro.apache.org. Retrieved 2016-09-27.
↑ "Apache Avro™ Specification: Encodings". avro.apache.org. Retrieved 2016-09-27.
↑ "Apache Avro™ Getting Started (Python)". avro.apache.org. Retrieved 2016-06-16.
↑ "Apache Avro™ Getting Started (Python)". avro.apache.org. Retrieved 2016-06-16.
↑ "Apache Avro™ Specification: Data Serialization". avro.apache.org. Retrieved 2016-06-16.
↑ phunt. "GitHub - phunt/avro-rpc-quickstart: Apache Avro RPC Quick Start. Avro is a subproject of Apache Hadoop". GitHub. Retrieved April 13, 2016.
↑ "Supported Languages - Apache Avro - Apache Software Foundation". Retrieved 2016-04-21.
↑ "Avro: 1.5.1 - ASF JIRA". Retrieved April 13, 2016.
↑ "[AVRO-533] .NET implementation of Avro - ASF JIRA". Retrieved April 13, 2016.
↑ "Supported Languages". Retrieved April 13, 2016.
↑ "Native Haskell implementation of Avro". Thomas M. DuBuisson, Galois, Inc. Retrieved August 8, 2016.
↑ "Apache Avro 1.8.0 IDL". Retrieved April 13, 2016.

Apache Software Foundation
Top level projects	Accumulo ActiveMQ Ambari Ant Apex Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Buildr Calcite Camel Cassandra Cayenne Chemistry CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Empire-db Felix Flex Flink Flume Forrest Geronimo Gora Guacamole Gump Hadoop Hama HBase Hive Impala Jackrabbit James Jini JMeter Kafka Karaf Kibble Kudu Kylin Labs Lucene Mahout Marmotta Maven MINA mod_perl MyFaces Nutch ODE OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pivot Qpid Roller RocketMQ Samza ServiceMix Shiro Sling Solr Spark Stanbol Storm SpamAssassin Sqoop Struts 1 Struts 2 Subversion SystemML Tapestry Thrift Tika Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	MXNet NetBeans SINGA Taverna XAP
Other projects	Batik Chainsaw FOP Ivy Log4j
Attic	Abdera AxKit Beehive Bluesky iBATIS Cactus Click Continuum Deltacloud Etch Excalibur Harmony HiveMind Jakarta Lenya Shale Shindig Slide stdcxx Tuscany Wave Wink XMLBeans
Licenses	Apache License
Category Commons

Data exchange formats
Human readable formats	Atom XML YAML JSON RDF Rebol RSS OWL
Binary formats	AMF ASN.1 SMI Avro BSON CBOR FlatBuffers MessagePack Protocol Buffers Thrift Smile XDR

Apache Avro

Avro Object Container File^[2]

Schema definition^[4]

Serializing and deserializing

Example serialization and deserialization code in Python^[5]

Languages with APIs

Avro IDL

See also

References

Further reading

Apache Avro

Avro Object Container File[2]

Schema definition[4]

Serializing and deserializing

Example serialization and deserialization code in Python[5]

Languages with APIs

Avro IDL

See also

References

Further reading

Avro Object Container File^[2]

Schema definition^[4]

Example serialization and deserialization code in Python^[5]