Papertrail Knowledge Base

Log analytics with Hadoop and Hive

Get interactive SQL access to months of Papertrail log archives (using Hadoop and Hive), in 5-10 minutes, without any new hardware or software.

This quick start assumes basic familiarity with AWS. For step-by-step instructions or to customize, see Intro to Hadoop and Hive.



Start cluster and load logs

Use the elastic-mapreduce Ruby gem to start a 1-node cluster and run a Papertrail Hive script to load archives. Change the ELASTIC_MAPREDUCE_* variables to your AWS credentials. Change YOUR-ARCHIVE-BUCKET to your S3 archive bucket, like acmeinc-aws-logs.

IMPORTANT: This will start billing for an m1.small instance immediately and continue charging until you manually terminate the job. As of this writing, the fee is $0.075 (7.5 cents) per hour. See Terminate.

$ sudo gem install elastic-mapreduce  # see
$ export ELASTIC_MAPREDUCE_PRIVATE_KEY="def456ghi789"
$ export ELASTIC_MAPREDUCE_KEY_PAIR="my-ec2-key"
$ export ELASTIC_MAPREDUCE_KEY_PAIR_FILE="/path/to/my-ec2-key.pem"
$ elastic-mapreduce --create --alive --name "Papertrail log archives" \
    --num-instances 1 --instance-type m1.small \
    --hive-script s3n:// \
    --args "-d,INPUT=s3n://YOUR-ARCHIVE-BUCKET/papertrail/logs"


You'll receive a job ID like j-1234567890123. In about 5 minutes, obtain the EMR node's hostname and check its state. When the state is WAITING, SSH to it as hadoop with your AWS key:

bob@laptop $ elastic-mapreduce -j j-1234567890123 --describe
bob@laptop $ ssh -i ~/my-ec2-key.pem

Query logs

After logging in via SSH, run the hive query tool to perform queries:

hadoop@domU-12-34-56-78-90:~$ hive

hive> DESCRIBE events
hive> SELECT message FROM events LIMIT 5

Web interface

To access the Hadoop Web interface, start an SSH session which tunnels your port 9100 to the EMR node, then visit http://localhost:9100/ in a browser:

bob@laptop $ ssh -L 9100:localhost:9100 -i ~/my-ec2-key.pem


When finished, terminate the job via the AWS console or CLI:

$ elastic-mapreduce -j j-1234567890123 terminate

Manipulate data

The script above loads the archives in a table called events, which has one column per archive field. This is best for most logs and works well for full-text ad-hoc search, such as:

SELECT * FROM events WHERE message LIKE '%some string%';

If some or all of your logs are in a key=value or JSON format, here's a few ways to denormalize columns.

key=value data

For log messages in key=value key2=value2 format, like from scrolls, individual keys can be referenced as columns. To do this, run the elastic-mapreduce command in Setup, but change the --hive-script URL to s3n://

Papertrail will create an events_kv table where the message is a hash instead of a string. The log message path=/abc/def status=200 appcode=abc size=5310 could be queried with:

select message['status'], message['appcode'] from events_kv;

The create_events_kv.q script assumes that = delimits keys from values and (space) delimits key=value pairs from one another. These can be changed.


Quick start

Hive has two popular ways of working with JSON:

More complex JSON: Hive-JSON-Serde

Here's an example of using this serializer/deserializer ("SerDe") to make an array, and a hash of arrays, queryable as native Hive columns:

CREATE TABLE json_nested_events (
    country string,
    languages array<string>,
    religions map<string,array<int>>)
select religions['catholic'][0] from json_nested_events;

For more, see README on GitHub, including Nested structures.

Simpler JSON: Lateral view

Hive supports a LATERAL VIEW which can act on a string column (like events.message from Setup). See Hive and JSON made simple for user-defined functions (UDFs) and examples.

Combine with json_tuple() to analyze arbitrary JSON attributes. For example, given the JSON log message:

`{ "status":200, "appcode":"abc" }`


SELECT received_at, message_view.* FROM events 
  LATERAL VIEW json_tuple(events.message, 'status', 'appcode') message_view AS status, appcode;

This uses an alias ("view") message_view, which is the parsed JSON in events.message. The query returns the event.received_at alongside two JSON hash keys: status, as result set column f1, and appcode, as f2. Messages which are not valid JSON or are valid JSON but do not contain the requested keys will return NULL. They can be filtered in a WHERE clause.

More complex LATERAL VIEW examples

Nested JSON elements can be accessed the same way. For example, consider changeset (a hash) and ids (an array):

`{ "status":200, "appcode":"abc", "changeset":{ "name":"Sally", "address":"123 Home Row" }, "ids":[3,7,10] }`

Turn values from the nested changeset hash into columns:

SELECT received_at, message_view.status, changeset_view.* FROM events 
  LATERAL VIEW json_tuple(events.message, 'status', 'changeset') message_view AS status, changeset
  LATERAL VIEW json_tuple(message_view.changeset, 'name', 'address') changeset_view AS name, address;

Also explode the ids values into a column:

SELECT received_at, message_view.status, message_view.ids, changeset_view.*, ids_view.exploded_id FROM events 
  LATERAL VIEW json_tuple(events.message, 'status', 'changeset', 'ids') message_view AS status, changeset, ids
  LATERAL VIEW json_tuple(message_view.changeset, 'name', 'address') changeset_view AS name, address
  LATERAL VIEW explode(split(regexp_replace(message_view.ids, '^\\[|\\]$', ''), ',')) ids_view AS exploded_id;

Although the Hive-JSON-Serde might be a better fit for JSON this complex, it's still possible using LATERAL VIEW. For additional examples, see Hive plays well with JSON.

Everything else

Hive supports outputting results from one query into a second table, where it can be persisted for additional reporting. Use SELECT to generate useful results, then CREATE TABLE my_second_table .. followed by INSERT OVERWRITE TABLE my_second_table SELECT .. to persist them.

This can also be used to extract values from within strings. See Hive's functions, such as regexp_replace(), sum(), and parse_url_tuple.

Next steps

Something you're trying to accomplish? Just ask.