Data Collection and Processing Made Easy

Data Data Everywhere

Application servers generate many different types of data:

  • App performance metrics
  • App business metrics
  • Server operations information
  • Log file messages
  • App error reports / tracebacks
  • App security / anomalous activity event reports

Data Data Everywhere

These various pieces of data are generally intended to be delivered to a a number of different destinations and audiences.

  • App performance metrics: ops / dev dashboards
  • App business metrics: biz dashboards / query engines
  • Server operations information: ops dashboards / notifications
  • Log file messages: ops / dev parsing tools
  • App error reports: dev debugging tools
  • App security: ops / opsec / infrasec

Data Data Everywhere

Getting the data to each of these separate destinations usually requires a specific set of tools and technologies.

  • App performance metrics: statsd / graphite / etc.
  • App business metrics: analytics engines / map reduce tools / JS dashboard toolkits
  • Server operations information: nagios / zenoss
  • Log file messages: syslog / file systems / logstash
  • App error reports: arecibo / sentry
  • App security: CEF / CEP / arcsight

Hodge Podge

We've spoken to a number of folks, both inside Mozilla and at other well known tech companies, and everyone is doing pretty much the same thing: wiring up a number of different open source and paid service provider options into one pseudo-integrated system held together with duct tape and baling wire.

And they're all spending more time, energy, and money on it than they'd like to be.

But Wait!

If everyone is putting in more effort than they'd like into managing and integrating all of these separate services, then maybe it's a sign that the tools themselves should be more tightly integrated.

Core problem

There are many domain specific differences, but most of these requirements can be described as a need to perform the following steps:

  1. Acquire some generated data
  2. Process / collate said data for audience consumption
  3. Deliver processed data to appropriate destination


Heka is a tool that hopes to help solve this problem in the general case. It is a framework for building systems which perform these tasks:

  1. Accept data from outside sources
  2. Convert data into a standard internal representation (i.e. `Message` objects)
  3. Ascertain and route to data's intended destination(s)
  4. Perform any required "in flight" processing
  5. Deliver collated data to intended destination(s)


Heka achieves its goals (hopefully!) by the use of a plugin architecture very much stolen from inspired by Logstash. There are four different types of plugins:

  1. Inputs
  2. Decoders
  3. Filters
  4. Outputs


Heka inputs receive data from the outside world, in any number of arbitrary ways:

  • Listening on a network (syslog, statsd, heka, http, ...)
  • Log file parsing
  • Periodically pulling data in from other systems


Heka decoders convert raw data received by an input into Message structs that can be processed by filters and outputs. Decoders can convert from various formats:

  • JSON
  • Protocol Buffers
  • MessagePack
  • Log messages (unstructured string data)
  • Log messages (structured data)


Heka filters process decoded messages. They can do any required crunching and collating:

  • Aggregation / roll-ups
  • Simple event processing / windowed stream operations
  • Parsing / extracting structured information from unstructured data


After a filter has processed a message, it can perform any of three steps:

  • Pass the original message through to an output unchanged
  • Generate a new message of any arbitrary type to be re-routed through the system or delivered directly to a specific output
  • Nothing (in cases where the filter watching the stream is to generate output messages only under specific circumstances)


Heka outputs write data to or trigger activity in the outside world. They perform tasks such as:

  • Writing to log files
  • Writing to databases
  • Writing to time series databases
  • Signaling an external notification system
  • Relaying data to upstream Heka aggregators
  • Sending data to external services


Routing of messages through the Heka system is handled using "message matchers". Filters and outputs use a simple, high performance grammar to specify what messages they are interested in processing or handling.

Type == "counter" && Payload == "1"
Type == "applog" && Logger == "marketplace"
Type == "alert" && (Severity==7 || Payload=="emergency")
Type == "myapp.metric" && Fields[name] == "successes"


The matching grammar supports regular expression matching on message contents, including match group extraction, i.e. filters and outputs can get access to the string data that caused the message routing match.

Type == "applog" && Payload =~ /^MARKER:/
Type == "applog" && Payload =~ (?P<pl>Payload)

But you don't pay a performance penalty for regex or group matching unless it's actually used.

Writing plugins

Heka is written in Go, and all Heka plugins can be written in Go. Filter plugins, however, can also be written in a sandboxed scripting language, allowing dynamic loading and security / resource utilization protection. Heka currently provides a working Lua sandbox.

Road map / wish list

While the basic infrastructure for Heka is in place, the project is just getting started. Here are some features that are either on the road map or are under consideration:

  • Message delivery guarantees
  • Heka self-monitoring & reporting
  • Javascript sandbox
  • Pre-configured Heka "flavors"
  • HTTP interface for time series queries
  • App instrumentation
  • Moar plugins!!