AWS has made great strides to make CloudTrail far more useful in the past year. Recently AWS has provided a point & click wizard in CloudTrail to setup Athena validating the strengths of this approach but they stop short of giving great guidance on how to use and scale it.
I want to thank Corcoran Smith @corcoranCI for reminding me to update this article.
Setting Up the Tables
AWS released the CloudTrail SerDe sometime after my last post and I have been using for the past 6 to 9 months. If you look at the last article you will notice that there was a very complicated CREATE TABLE statement, luckily that has changed to this:
Some of the key changes are in how the data is parsed, there is less de-nesting of JSON but ultimately its much easier to query now. Also, I ensure I added Partitions to the tables, I’ll explain why that is important later, but do it now and I’ll show you how to automate it next.
Getting Started With Partitions
What I discovered when I was querying data going back over time was inefficiencies and more importantly increased cost. Partitions in Athena are the right way to solve this but to do that you have to add them individually to each table.
Given the amount of logs I have and the infrequency which I look at some regions, I decided to partition on region, year, and month. To do that I first looked at Boto3, but unfortunately as of this writing there still is not a Waiter function for Athena queries. It can be very easy to overrun any quotas or limits on the DDL statements on concurrent query limits, so I went looking and found the fantastic overlay on Boto3/CLI called athena-CLI which I can not recommend more highly.
To add the partitions, I loaded up a script and used the waiters native in athena-cli to ensure I didn’t overrun. I added some concurrency to keep it under my DDL limit but to add some speed improvements.
For example, here is a query to add a partition to us-east-1 for April 2018 for account “999999999999”
Also, you can pre-partition your data, so I generally load up a year’s worth of partitions at once. Athena does not care if the folder is present or not when you setup the partition. It is important to note, if you setup partitions in your schema, if you do not create them, you will never see the data when you query.
Note: When AWS presents you with the DDL from the CloudTrail screen, it does not contain partitions, I strongly encourage you to add them
Working with the data
Lets look at the structure of a few records as they appear now.
Looking at these data sets, you get simpler queries. Looking at the first data set, here is a query which would have that record in its output, as well as others:
In this query you can see that useridentity allowed dotted notation addressing of sub-fields which allows for very powerful queries using the Presto framework including Regular Expressions. The other columns which can be addressed normally via column names, again becoming much simpler.
Query and Performance Comparison
Now that we see the data is a bit easier to comprehend, how much easier is it to write? Also as important is how much faster and efficient is it to run? To do this test, I ran the following two queries against my largest account.
I am looking to find the 20 highest counts of a tuple of Event Names/ARN/SourceIP for March 2018 in us-east-1 only.
The old table and query format:
Scanned 13.97GB of data
The new table and query format:
Scanned 1.2GB of data
In summary, the new table structure and queries are much faster, cheaper, and easier. In fact, on average, they are 500x cheaper and 400% faster. There really is little disadvantage to changing to the new schema.
Do you have any cool queries you wrote to summarize your data? I would love to hear from you.