Moving Logs from Acquia into Elasticsearch

Written by MikeMay 9, 2017

We started offloading web logs from Acquia into Elasticsearch, and let me tell you, it's been amazing!

Imagine being able to query your apache access log for https status 500 errors that occurred in the last 4 hours; or retrieving the most common 404 paths. Maybe you're more visual and want to see pie charts of bytes downloaded per country, or histograms of status codes per half hour. Perhaps you want to trace the footprints a certain IP address made on a particular domain, then across multiple domains.

With your logs in Elasticsearch, you will be at the mercy of your imagination for ways to consume the data within.

Marji Cermak's (@cermakm) 2016 Drupalcon session was the catalyst that inspired us to get our hands dirty with Elastic Stack. It's a great session to watch for context around what I'm about to show you.

Elastic Stack (AKA ELK Stack)

Elasticsearch: A RESTful distributed document store and analytics engine.
Logstash: A processing pipeline, which in our case, injests data from access logs and sends it to Elasticsearch
Kibana: A GUI for running API calls, queries, and visualizing data from Elasticsearch.

Requirements

A server to run cron for rsyncing logs and running Logstash
An Elasticsearch instance (This could be on the same server, we use a hosted Elastic Cloud instance. They offer a free trial.)
ssh access to web server for rsync

We opted to host our Elasticsearch instance with elastic.co although you can spin up your own instance for free. Elasticsearch setup beyond the scope of this article. Kibana will be provided with a hosted instance. If spinning up your own instance, you will need to install and run Kibana as well.

For the Logstash/rsync server, we're running an AWS instance of Ubuntu 16.04LTS.

Logstash Installation

1. Download Logstash and extract it to /opt/logstash.

2. Setup Logstash to be run as a service with Supervisord.

Create a file: /etc/supervisor/conf.d/logstash.conf

Add the contents below to this file:

[program:logstash]
command=/opt/logstash/bin/logstash -f /etc/logstash/conf.d
autorestart=true
environment=HOME="/root"

Don't start it yet, but now Logstash can managed with systemctl. i.e. systemctl start supervisor

Log Directory, Rsync, and Cron

In order to rsync logs to your server, you will need SSH access to your host. Typically you will generate SSH keys for this server, then add the key to your host. The rest of this article assumes you have this configured.

1. Create a directory to store your logs:

$ mkdir -p /opt/log/mysite

2. Create a file for rsync commands:

$ touch /opt/bin/sync-logs.sh && chmod +x /opt/bin/sync-logs.sh

This file will be executed via cron to pull down the latest log file from the web host.

The contents of this file sill vary depending on your host. The following is an example of an rsync command formatted with Acquia's path structure.

rsync -avz [email protected]:/mnt/log/sites/mysite.prod/logs/srv-1234/access.log /opt/log/mysite/access.log

3. Test the executable

$ /opt/bin/sync-logs.sh

You should now see the log file in /opt/log/mysite/access.log

4. Setup cron to run execute this file every minute

*/1 * * * * /opt/bin/sync-logs.sh > /dev/null 2>&1

Great! Now we have the latest log file available for Logstash to parse and push to Elasticsearch.

Configure Logstash

Now we can create the Logstash configuration file.

$ mkdir -p /etc/logstash/conf.d
$ touch /etc/logstash/conf.d/mysite.conf

The example config below works with Acquia's log format. They add custom elements to the log so it will not work for any Apache log. You can adapt the filter section to parse other formats.

Be sure to replace the Elasticsearch credentials with your own in the output section below.

input {
  file {
    path => "/opt/log/mysite/access.log"
    start_position => "beginning"
    type => "access"
    add_field => {
      "project" => "mysite"
      "env" => "prod"
      "xhost" => "srv-1234.devcloud.hosting.acquia.com"
    }
  }
}

filter {
  if [type] == "access" {
    grok {
      match => [
        "message", "(?:%{IPORHOST:ip}|-) - (?:%{USER:auth}|-) \[%{HTTPDATE:timestamp}\] \"(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})\" %{NUMBER:response} (?:%{NUMBER:bytes}|-) \"%{DATA:referrer}\" \"%{DATA:agent}\" vhost=%{IPORHOST:vhost} host=%{IPORHOST:domain} hosting_site=%{WORD} pid=%{NUMBER} request_time=%{NUMBER:request_time} forwarded_for=\"(?:%{IPORHOST:forwarded_for}|)(?:, %{IPORHOST}|)(?:, %{IPORHOST}|)\" request_id=\"%{NOTSPACE:request_id}\""
      ]
    }
    mutate {
      update => { "host" => "%{xhost}" }
      replace => { "path" => "%{request}" }
    }
    if (![ip]) {
      mutate {
        add_field => {
          "ip" => "%{forwarded_for}"
        }
      }
    }
    geoip { source => "ip" }
    date {
      locale => "en"
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
      target => "@timestamp"
    }
    mutate {
      remove_field => [ "forwarded_for", "message", "request", "timestamp", "xhost" ]
    }
  }
}

output {
  if [type] == "access" {
    elasticsearch {
      hosts => [ "https://example.com:9200" ]
      user => "elastic"
      password => "secret"
      index => "logstash-access"
    }
  }
}

For multiple sites, simply add another input block for each. You can see I've added project and env vars. Those allow us to target queries to a specific environment as exact domain and subdomain names may vary even within a single environment.

Start Logstash

$ sudo systemctl start supervisor

Shortly after starting Logstash, you should be able to login to Kibana and see logs.

Conclusion

This setup is relatively easy for what we're achieving. This article however is just meant to get you up and running. This is not optimized for production use.

It's important to note that we have not defined a data model, and Elasticsearch will do it's best to automatically assign types to the document properties being sent in. In many cases this is just fine, but may also be inefficient and restrictive depending on how you plan to use the data.

If it seems like Elasticsearch is a good fit, it would be worthwhile to dedicate some time to learn more about the inner workings so you could then define templates and indexes explicitly to optimize storage and add flexibility in how data is retrieved.

Elasticsearch info on the web is pretty fuzzy due to past versions. Elastic offers developer and operations training either live, or over the web. We attended the 2 day developer training in Los Angeles. It was jam packed with detail and provided clarity on best practices and where the project is going in the future. We highly recommend this training if you are going to integrate Elastic products into your infrastructure.

Comments