Permanent log archives

Introduction

Each night, Papertrail automatically uploads log messages and metadata to Amazon’s cloud storage service, S3. Papertrail stores one copy in our S3 bucket, and optionally, also stores a copy in a bucket that you provide. You have full control of the optional archive in your own bucket, since it’s tied to your AWS account.

Want to set up S3? Jump to Automatic S3 Archive Export.

Format

Check the account’s archives to see whether it has hourly or daily archives for a particular date. Archive frequency may change when account log volume changes.

Each line contains one message. The fields are ordered:

id
generated_at
received_at
source_id
source_name
source_ip
facility_name
severity_name
program
message

For a longer description of each column, see Log Search API: Responses.

Here are the fields for an example message:

50342052
2011-02-10 00:19:36 -0800
2011-02-10 00:19:36 -0800
42424
mysystem
208.122.34.202
User
Info
testprogram
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor

Archives are in tab-separated values (.tsv) format, so a line actually looks like this:

50342052\t2011-02-10 00:19:36 -0800\t2011-02-10 00:19:36 -0800\t42424\tmysystem\t208.122.34.202\tUser\tInfo\ttestprogram\tLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor

The TSV files are gzip-compressed (.gz) to reduce size. Gzip compression is compatible with UNIX-based zip tools and third-party Windows zip tools such as 7-zip and WinZip.

Usage example

Show identical messages

Here’s how to extract the message (field 10) from the archive file 2016-10-31.tsv.gz, then show the messages sorted by the number of identical occurrences (duplicates).

$ gzip -cd 2016-10-31.tsv.gz | cut -f10 | sort | uniq -c | sort -n

Windows PowerShell can do the same thing, with 7-Zip’s help. In this example, [9] still selects the message (field 10), due to zero-based indexing.

$ 7z x -so 2016-10-31.tsv.gz | %{($_ -split '\t')[9]} | group | sort count,name | ft count,name -wrap

Show similar messages

The most common messages often differ only by a random number, IP address, or message suffix. These near-duplicates can be discovered with a bit more work.

Here’s how to extract the sender, program, and message (fields 5, 9, and 10) from all archive files, squeeze whitespace and digits, truncate after eight words, and sort the result by the number of identical occurrences (duplicates).

$ gzip -cd *.tsv.gz  | # extract all archives
    cut -f 5,9-      | # sender, program, message
    tr -s '\t' ' '   | # squeeze whitespace
    tr -s 0-9 0      | # squeeze digits
    cut -d' ' -f 1-8 | # truncate after eight words
    sort | uniq -c | sort -n

or, as a one-liner:

$ gzip -cd *.tsv.gz | cut -f 5,9- | tr -s '\t' ' ' | tr -s 0-9 0 | cut -d' ' -f 1-8 | sort | uniq -c | sort -n

Once again, Windows PowerShell can do the same thing, with 7-Zip’s help.

> 7z x -so *.tsv.gz                      | # extract all archives
    %{($_ -split '\t')[4,8,9] -join ' '} | # sender, program, message
    %{$_ -replace ' +',' '}              | # squeeze whitespace
    %{$_ -replace '[0-9]+','0'}          | # squeeze digits
    %{($_ -split ' ')[0..7] -join ' '}   | # truncate after eight words
    group | sort count,name | ft count,name -wrap

or, as a one-liner:

> 7z x -so *.tsv.gz | %{($_ -split '\t')[4,8,9] -join ' '} | %{$_ -replace ' +',' '} | %{$_ -replace '[0-9]+','0'} | %{($_ -split ' ')[0..7] -join ' '} | group | sort count,name | ft count,name -wrap

Downloading logs

In addition to being downloadable on Archives, you can retrieve archive files using your Papertrail HTTP API key, as part of the HTTP API. The URL format is simple and predictable:

  • Daily: https://papertrailapp.com/api/v1/archives/YYYY-MM-DD/download
  • Hourly: https://papertrailapp.com/api/v1/archives/YYYY-MM-DD-HH/download

Check the account’s archives to see whether it has hourly or daily archives for a particular date.

Download a specific archive

Daily

Download the archive for 2016-09-24 (UTC) with:

$ curl --no-include -o 2016-09-24.tsv.gz -L -H "X-Papertrail-Token: YOUR-HTTP-API-KEY" \
    https://papertrailapp.com/api/v1/archives/2016-09-24/download

Hourly

Download the archive for 2016-09-24 at 14:00 UTC with:

$ curl --no-include -o 2016-09-24-14.tsv.gz -L -H "X-Papertrail-Token: YOUR-HTTP-API-KEY" \
    https://papertrailapp.com/api/v1/archives/2016-09-24-14/download

Download a large number of archives

Occasionally, it may be useful to pull down a large number of archives without needing to check the archive frequency. This cURL script will run on a variety of UNIX platforms and download archives between two dates:

curl -sH 'X-Papertrail-Token: YOUR-HTTP-API-KEY' https://papertrailapp.com/api/v1/archives.json |
  grep -o '"filename":"[^"]*"' | egrep -o '[0-9-]+' |
  awk '$0 >= "YYYY-MM-DD" && $0 < "YYYY-MM-DD" {
    print "output " $0 ".tsv.gz"
    print "url https://papertrailapp.com/api/v1/archives/" $0 "/download"
  }' | curl -#fLH 'X-Papertrail-Token: YOUR-HTTP-API-KEY' -K-

Enter the start and end dates in the third line of the script. For example, to download archives for September and October 2017, that line would read awk '$0 >= "2017-09-01" && $0 < "2017-11-01" {.

Download a single archive using date

It’s also possible to use the date tool to run regular automated requests or one-off bulk downloads from a short time period.

Daily

For example, to download yesterday’s daily archive on a Linux host, run:

$ curl -silent --no-include -o `date -u --date='1 day ago' +%Y-%m-%d`.tsv.gz -L \
    -H "X-Papertrail-Token: YOUR-HTTP-API-KEY" \
    https://papertrailapp.com/api/v1/archives/`date -u --date='1 day ago' +%Y-%m-%d`/download

This could be run daily as a cron job to fetch local copies of the archives.

Hourly

If Papertrail generates hourly archives for your account, download the archive for 16 hours ago with:

$ curl -silent --no-include -o `date -u --date='16 hours ago' +%Y-%m-%d-%H`.tsv.gz -L \
    -H "X-Papertrail-Token: YOUR-HTTP-API-KEY" \
    https://papertrailapp.com/api/v1/archives/`date -u --date='16 hours ago' +%Y-%m-%d-%H`/download

Command syntax

As you can see, there’s a lot going on in those cURL one-liners. The main parts are:

  • -o `date -u --date='1 day ago' +%Y-%m-%d`.tsv.gz: Downloads the archive to a file with yesterday’s date (UTC) in the format YYYY-MM-DD.tsv.gz
  • -H "X-Papertrail-Token: YOUR-HTTP-API-KEY": Authenticates the request via your API token, found under your profile.

Download multiple archives using date

Daily

To download multiple daily archives in one command, use:

$ seq 1 X | xargs -I {} date -u --date='{} day ago' +%Y-%m-%d | \
    xargs -I {} curl --progress-bar -f --no-include -o {}.tsv.gz \
    -L -H "X-Papertrail-Token: YOUR-HTTP-API-KEY" https://papertrailapp.com/api/v1/archives/{}/download

where X is the number of days + 1 that you want to download. For example, to guarantee 2 days, change X to 3; see note below for details. To specify a start date, for example: 10th August 2017, combine the {} day ago specification with the start date:

$ seq 1 X | xargs -I {} date -u --date='2017-08-10 {} day ago' +%Y-%m-%d | \
    xargs -I {} curl --progress-bar -f --no-include -o {}.tsv.gz \
    -L -H "X-Papertrail-Token: YOUR-HTTP-API-KEY" https://papertrailapp.com/api/v1/archives/{}/download

Hourly

To download multiple hourly archives in one command, use:

$ seq 1 X | xargs -I {} date -u --date='{} hours ago' +%Y-%m-%d-%H | \
    xargs -I {} curl --progress-bar -f --no-include -o {}.tsv.gz \
    -L -H "X-Papertrail-Token: YOUR-HTTP-API-KEY" https://papertrailapp.com/api/v1/archives/{}/download

where X is the number of hours + 1 that you want to download. For example, to guarantee 8 hours, change X to 9.

To specify a start date, for example: 1 November 2017, combine the {} day ago specification with the start date:

$ seq 1 X | xargs -I {} date -u --date='2017-11-01 {} hours ago' +%Y-%m-%d-%H | \
    xargs -I {} curl --progress-bar -f --no-include -o {}.tsv.gz \
    -L -H "X-Papertrail-Token: YOUR-HTTP-API-KEY" https://papertrailapp.com/api/v1/archives/{}/download

Command syntax

The seq 1 X command is being used to generate date or hour offsets, starting with 1 (1 day or hour ago) because the current day or hour will not yet have an archive. Since archive processing takes time, near the beginning of the hour or UTC day, the previous day or hour also may not have an archive yet (and will return 404 when requested). Thus, to guarantee that you get at least X days/hours, replace X with the number of days/hours + 1.

macOS

Using macOS and see date: illegal option -- -? In the examples above, change:

  • Daily archives: change --date='{} days ago' to -v-{}d
  • Hourly archives: change --date='{} hours ago' to -v-{}H

This option format doesn’t have the same capability as the standard date arguments to provide a simple YYYY-MM-DD start date, but more complex uses of date can have the same result without requiring too much time math:

Start N days ago, go back by hours

seq 1 X | xargs -I {} date -v-Nd -v-{}H +%Y-%m-%d-%H | \
    xargs -I {} curl --progress-bar -f --no-include -o {}.tsv.gz \
    -L -H "X-Papertrail-Token: YOUR-HTTP-API-KEY" https://papertrailapp.com/api/v1/archives/{}/download

Replace X with the number of hours to go back, and N with the number of days ago to start.

Start at a past date, go back by hours

seq 1 X | xargs -I {} date -ur `date -ju MMDDHHmm +%s` -v-{}H +%Y-%m-%d-%H| \
    xargs -I {} curl --progress-bar -f --no-include -o {}.tsv.gz \
    -L -H "X-Papertrail-Token: YOUR-HTTP-API-KEY" https://papertrailapp.com/api/v1/archives/{}/download

Replace X with the number of hours to go back, and MMDDHHmm with the desired start date (mm will always be 00).

Searching

To find an entry in a particular archive, use commands such as:

$ gzip -cd 2016-02-25.tsv.gz | grep Something

$ gzip -cd 2016-02-25.tsv.gz | grep Something | cut -f5,9,10 | tr '\t' ' '

The files are generic gzipped TSV files, so after un-gzipping them, anything capable of working with a text file can work with them.

If the downloaded files have file names such as 2013-08-18.tsv.gz (the default), multiple archives can be searched using:

$ gzip -cd 2013-08-* | grep SEARCH_TERM

Syncing

To transfer multiple archives from Papertrail’s S3 bucket to a custom bucket, use the relevant download command mentioned above, and then upload them to another bucket using:

$ s3cmd put --recursive path/to/archives/ s3://bucket.name/the/path/

where path/to/archives/ is the local directory where all the archives are stored, and bucket.name/the/path/ is the bucket and path of the target S3 storage location.

S3 Bucket Setup

See Automatic S3 archive export.