Log analytics with Hadoop and Hive
Get interactive SQL access to months of Papertrail log archives (using Hadoop and Hive), in 5-10 minutes, without any new hardware or software.
This quick start assumes basic familiarity with AWS. For step-by-step instructions or to customize, see Intro to Hadoop and Hive.
Requirements
- An Amazon Web Services account with an EC2 key pair and the Elastic MapReduce service activated
- Provide Papertrail with an S3 bucket for archives
Setup
Load logs
Use the
elastic-mapreduce Ruby gem to start a 1-node cluster. Change
the ELASTIC_MAPREDUCE_* variables to your AWS
credentials. Change YOUR-ARCHIVE-BUCKET to your S3
archive bucket, like
acmeinc-aws-logs.
IMPORTANT: This will start billing for an
m1.small instance
immediately and continue charging until you manually
terminate the job. As of this writing, the fee is $0.075
(7.5 cents) per hour. See Terminate.
$ sudo gem install elastic-mapreduce # see https://github.com/tc/elastic-mapreduce-ruby
$ export ELASTIC_MAPREDUCE_ACCESS_ID="abc123"
$ export ELASTIC_MAPREDUCE_PRIVATE_KEY="def456ghi789"
$ export ELASTIC_MAPREDUCE_KEY_PAIR="my-ec2-key"
$ export ELASTIC_MAPREDUCE_KEY_PAIR_FILE="/path/to/my-ec2-key.pem"
$ elastic-mapreduce --create --alive --name "Papertrail log archives" \
--num-instances 1 --instance-type m1.small \
--hive-script s3n://hive-emr.papertrailapp.com/create_events.q \
--args "-d,INPUT=s3n://YOUR-ARCHIVE-BUCKET/papertrail/logs"
You'll receive a job ID like j-1234567890123. In
about 5 minutes, obtain the EMR node's hostname and check its
state. When the state is WAITING, SSH to it as
hadoop with your AWS key:
bob@laptop $ elastic-mapreduce -j j-1234567890123 --describe
bob@laptop $ ssh -i ~/my-ec2-key.pem hadoop@ec2-1-2-3-4.compute-1.amazonaws.com
Query logs
After logging in via SSH, run the hive query tool
to perform queries:
hadoop@domU-12-34-56-78-90:~$ hive
hive> DESCRIBE events
hive> SELECT message FROM events LIMIT 5
Web interface
To access the Hadoop Web
interface, start an SSH session which tunnels your port
9100 to the EMR node, then visit
http://localhost:9100/ in a browser:
bob@laptop $ ssh -L 9100:localhost:9100 -i ~/my-ec2-key.pem hadoop@ec2-1-2-3-4.compute-1.amazonaws.com
Terminate
When finished, terminate the job via the AWS console or CLI:
$ elastic-mapreduce -j j-1234567890123 terminate
Manipulate data
The script above loads the archives in a table called
events, which has one column per archive field. This is
best for most logs and works well for full-text ad-hoc search, such
as:
SELECT * FROM events WHERE message LIKE '%some string%';
If some or all of your logs are in a key=value or
JSON format, here's a few ways to
denormalize columns.
key=value data
For log messages in key=value key2=value2 format,
like from scrolls,
individual keys can be referenced as columns. To do this, run the
elastic-mapreduce command in Setup, but change the --hive-script URL
to
s3n://hive-emr.papertrailapp.com/create_events_kv.q.
Papertrail will create an events_kv table where the
message is a hash instead of a string. The log message
path=/abc/def status=200 appcode=abc size=5310 could
be queried with:
select message['status'], message['appcode'] from events_kv;
The create_events_kv.q script assumes that
= delimits keys from values and (space) delimits
key=value pairs from one another. These can be changed.
JSON
Hive supports a LATERAL VIEW which can act on a
string column (like events.message from Setup). Combine with
json_tuple() to analyze arbitrary JSON attributes. For example,
given the JSON log message:
`{ "status":200, "appcode":"abc" }`
Run:
SELECT received_at, message_view.* FROM events
LATERAL VIEW json_tuple(events.message, 'status', 'appcode') message_view AS status, appcode;
This uses an alias ("view") message_view, which is
the parsed JSON in events.message. The query returns
the event.received_at alongside two JSON hash keys:
status, as result set column f1, and
appcode, as f2. Messages which are not
valid JSON or are valid JSON but do not contain the requested keys
will return NULL. They can be filtered in a
WHERE clause.
Nested JSON elements can be accessed the same way. For example,
consider changeset (a hash) and ids (an
array):
`{ "status":200, "appcode":"abc", "changeset":{ "name":"Sally", "address":"123 Home Row" }, "ids":[3,7,10] }`
Obtain one :
SELECT received_at, message_view.status, changeset_view.* FROM events_mixed
LATERAL VIEW json_tuple(events_mixed.message, 'status', 'changeset') message_view AS status, changeset
LATERAL VIEW json_tuple(message_view.changeset, 'name', 'address') changeset_view AS name, address;
For additional examples, see Hive plays well with JSON.
Hive can also import JSON using a special serializer/deserializer,
but using LATERAL VIEW is easier and often more
flexible.
Everything else
Hive supports outputting results from one query into a second
table, where it can be persisted for additional reporting. Use
SELECT to generate useful results, then CREATE TABLE
my_second_table .. followed by INSERT OVERWRITE TABLE
my_second_table SELECT .. to persist them.
This can also be used to extract values from within strings. See
Hive's
functions, such as regexp_replace(),
sum(), and parse_url_tuple.
Next steps
Something you're trying to accomplish?