Analyzing Apache Logs on the Cloud

With Treasure Data, you can analyze apache logs on the cloud. You can forget about the servers, storage, and infrastructure needed to store your billions of records and focus on analyzing your data instead.

This article explains how to import Apache logs using td-agent. The basic idea is that td-agent (1) continuously accepts new Apache logs, (2) parses the logs, and (3) periodically uploads the logs into the cloud. td-agent’s tail source type is used.

Table of Contents

Prerequisites

  • Basic knowledge of Treasure Data, including the toolbelt.
  • Basic knowledge of td-agent.
  • Ruby 1.8 or higher (for local testing).

Architecture

td-agent is a versatile daemon program that can process various kinds of streamed log data. td-agent is developed and maintained by Treasure Data, Inc..

When a user accesses an Apache web server, the web server appends a corresponding entry to the access log file (typically located at /var/log/apache2/access_log). td-agent continuously “tails” the access log, identifying the newly created entries.

td-agent automatically parses the new log entries to create meaningful data fields. The data is compressed and uploaded to the cloud periodically.

Installing td-agent

You must first set up td-agent on your application servers. Please refer to the following articles.

If you have... Please refer to...
Debian / Ubuntu Systems Installing td-agent for Debian and Ubuntu
Redhat / CentOS Systems Installing td-agent for Redhat and CentOS

For Linux systems, we provide deb/rpm packages.

td-agent is fully open-sourced under the fluentd project. td-agent extends fluentd with custom plugins for Treasure Data.

Modifying /etc/td-agent/td-agent.conf

In order to import Apache logs with td-agent, first add the following lines to your td-agent.conf file. Please note that the database and table names must be specified using the tag parameter.

# tail apache access_log
<source>
  type tail
  path /var/log/apache2/access_log
  format apache
  tag td.YOUR_DATABASE_NAME.YOUR_TABLE_NAME
</source>
On Redhat based systems, you might need to change the directory permission with `chmod g+rx /var/log/httpd` or `chmod o+rx /var/log/httpd`.

For example, if you set ‘td.testdb.apache_logs’ as the tag, the logs are sent to the apache_logs table within the testdb database.

Next, specify your authentication key by setting the apikey option. You can view your api key from the console. Next, set the apikey option in your td-agent.conf file.

*YOUR_API_KEY* should be your actual API key string.
# Treasure Data Output
<match td.*.*>
  type tdlog
  apikey YOUR_API_KEY
  auto_create_table
  buffer_type file
  buffer_path /var/log/td-agent/buffer/td
  use_ssl true
</match>

Now, restart td-agent to reflect the above changes.

$ sudo /etc/init.d/td-agent restart

td-agent now tails Apache logs, parses the new log entries, and uploads the parsed data into the cloud every 5 minutes.

Confirming Import

td-agent continuously uploads logs every 5 minutes. You can force td-agent to flush the buffered logs into the cloud by sending a SIGUSR1 signal.

$ curl http://127.0.0.1/
$ kill -USR1 `cat /var/run/td-agent/td-agent.pid`

You can confirm that data is getting uploaded successfully with the td tables command, which shows you how many rows have been saved on the cloud (the “Count” column).

$ td tables
+------------+------------+------+-----------+
| Database   | Table      | Type | Count     |
+------------+------------+------+-----------+
| test_db    | test_table | log  | 21321     |
+------------+------------+------+-----------+

Sample Queries

Here are a few sample queries that calculate some interesting results.

Top User Agents

$ td query -w -d testdb \
  "SELECT v['agent'] AS agent, COUNT(1) AS cnt \
   FROM www_access \
   GROUP BY v['agent'] ORDER BY cnt DESC LIMIT 3"
...
Result:
+---------------------------------------------------------------------------+-----+
| agent                                                                     | cnt |
+---------------------------------------------------------------------------+-----+
| Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/20100101 Firefox/10.0.1     | 630 |
| Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)    | 497 |
| Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)  | 341 |
+---------------------------------------------------------------------------+-----+

Top Paths

$ td query -w -d testdb \
  "SELECT v['path'] AS path, COUNT(1) AS cnt \
   FROM www_access \
   GROUP BY v['path'] ORDER BY cnt DESC LIMIT 3"
...
Result:
+-----------------------+-----+
| path                  | cnt |
+-----------------------+-----+
| /category/electronics | 639 |
| /category/software    | 416 |
| /category/books       | 285 |
+-----------------------+-----+

Top Referers within a Specific Day

Using TD_TIME_RANGE UDF is efficient and simple to use. Please refer to the Performance Tuning article for more information.

$ td query -w -d testdb \
  "SELECT v['referer'] AS referer, COUNT(1) AS cnt \
   FROM www_access6 \
   WHERE \
     TD_TIME_RANGE(time, '2012-05-26', '2012-05-27', 'PDT') \
   GROUP BY v['referer'] ORDER BY cnt DESC LIMIT 3"
...
Result:
+-----------------------+------+
| referer               | cnt  |
+-----------------------+------+
| -                     | 1939 |
| /category/electronics | 327  |
| /category/software    | 235  |
+-----------------------+------+

Access Count by Day

Using TD_TIME_FORMAT UDF is useful for daily aggregation.

$ td query -w -d testdb \
  "SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'PDT') AS day, COUNT(1) AS cnt \
   FROM www_access \
   GROUP BY TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'PDT')"
....
Result:
+------------+------+
| day        | cnt  |
+------------+------+
| 2012-05-26 | 5000 |
| 2012-05-27 | 6000 |
+------------+------+

Status Code Distribution by Day

Using TD_TIME_FORMAT UDF is useful for daily aggregation.

$ td query -w -d testdb \
  "SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'PDT') AS day, v['code'] AS code, COUNT(1) AS cnt \
   FROM www_access \
   GROUP BY TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'PDT'), v['code']"
....
Result:
+------------+------+------+
| day        | code | cnt  |
+------------+------+------+
| 2012-05-26 | 200  | 4981 |
| 2012-05-26 | 404  | 17   |
| 2012-05-26 | 500  | 2    |
| 2012-05-27 | 200  | 4481 |
| 2012-05-27 | 404  | 517  |
| 2012-05-27 | 500  | 2    |
+------------+------+------+

Min/Max/Avg Size by Day

Using TD_TIME_FORMAT UDF is useful for daily aggregation.

$ td query -w -d testdb \
  "SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'PDT') AS day, \
     max(cast(v['size'] AS INT)) AS max, \
     min(cast(v['size'] AS INT)) AS min, \
     avg(cast(v['size'] AS INT)) AS avg  \
   FROM www_access \
   GROUP BY TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'PDT')"
...
Result:
+------------+-----+-----+---------+
| day        | max | min | avg     |
+------------+-----+-----+---------+
| 2012-05-26 | 139 | 40  | 89.0056 |
| 2012-05-27 | 100 | 30  | 95.0056 |
+------------+-----+-----+---------+

If this article is incorrect or outdated, or omits critical information, please let us know. For all other issues, please see our support channels. Live chat with our staffs also work well.