Using AWS Athena and CloudTrail Revisited

By Thomas Vachon

What’s new with Athena and CloudTrail

AWS has made great strides to make CloudTrail far more useful in the past year. Recently AWS has provided a point & click wizard in CloudTrail to setup Athena validating the strengths of this approach but they stop short of giving great guidance on how to use and scale it.

Acknowledgements

I want to thank Corcoran Smith @corcoranCI for reminding me to update this article.

Setting Up the Tables

AWS released the CloudTrail SerDe sometime after my last post and I have been using for the past 6 to 9 months. If you look at the last article you will notice that there was a very complicated CREATE TABLE statement, luckily that has changed to this:

CREATE EXTERNAL TABLE my_table_name (
         eventversion STRING,
         userIdentity STRUCT< type:STRING,
         principalid:STRING,
         arn:STRING,
         accountid:STRING,
         invokedby:STRING,
         accesskeyid:STRING,
         userName:STRING,
         sessioncontext:STRUCT< attributes:STRUCT< mfaauthenticated:STRING,
         creationdate:STRING>,
         sessionIssuer:STRUCT< type:STRING,
         principalId:STRING,
         arn:STRING,
         accountId:STRING,
         userName:STRING>>>,
         eventTime STRING,
         eventSource STRING,
         eventName STRING,
         awsRegion STRING,
         sourceIpAddress STRING,
         userAgent STRING,
         errorCode STRING,
         errorMessage STRING,
         requestParameters STRING,
         responseElements STRING,
         additionalEventData STRING,
         requestId STRING,
         eventId STRING,
         resources ARRAY<STRUCT< ARN:STRING,
        accountId: STRING,
        type:STRING>>,
         eventType STRING,
         apiVersion STRING,
         readOnly STRING,
         recipientAccountId STRING,
         serviceEventDetails STRING,
         sharedEventID STRING,
         vpcEndpointId STRING 
) PARTITIONED BY(
         region string,
         year string,
         month string
) 
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://my_consolidated_bucket/my-cross-account-prefix/AWSLogs/

Some of the key changes are in how the data is parsed, there is less de-nesting of JSON but ultimately its much easier to query now. Also, I ensure I added Partitions to the tables, I’ll explain why that is important later, but do it now and I’ll show you how to automate it next.

Getting Started With Partitions

What I discovered when I was querying data going back over time was inefficiencies and more importantly increased cost. Partitions in Athena are the right way to solve this but to do that you have to add them individually to each table.

Given the amount of logs I have and the infrequency which I look at some regions, I decided to partition on region, year, and month. To do that I first looked at Boto3, but unfortunately as of this writing there still is not a Waiter function for Athena queries. It can be very easy to overrun any quotas or limits on the DDL statements on concurrent query limits, so I went looking and found the fantastic overlay on Boto3/CLI called athena-CLI which I can not recommend more highly.

To add the partitions, I loaded up a script and used the waiters native in athena-cli to ensure I didn’t overrun. I added some concurrency to keep it under my DDL limit but to add some speed improvements.

For example, here is a query to add a partition to us-east-1 for April 2018 for account “999999999999”

ALTER TABLE my_table_name ADD PARTITION (region='us-east-1',year='2018',month='04') 
location 's3://my_consolidated_bucket/my-cross-account-prefix/AWSLogs/999999999999/CloudTrail/us-east-1/2018/04/';

Also, you can pre-partition your data, so I generally load up a year’s worth of partitions at once. Athena does not care if the folder is present or not when you setup the partition. It is important to note, if you setup partitions in your schema, if you do not create them, you will never see the data when you query.

Note: When AWS presents you with the DDL from the CloudTrail screen, it does not contain partitions, I strongly encourage you to add them

Working with the data

Lets look at the structure of a few records as they appear now.

eventversion        | 1.04
useridentity        | {type=IAMUser, principalid=LKUWHE3545KJ34534L65U, 
                      arn=arn:aws:iam::999999999999:user/iam-username, 
                      accountid=999999999999, invokedby=null, 
                      accesskeyid=FAKEACCESSID, username=iam-username, 
                      sessioncontext=null}
eventtime           | 2018-04-04T23:55:30Z
eventsource         | monitoring.amazonaws.com
eventname           | DescribeAlarms
awsregion           | us-east-1
sourceipaddress     | 198.51.100.144
useragent           | aws-cli/1.11.132
errorcode           | NULL
errormessage        | NULL
requestparameters   | {"alarmNames":["alarmname"]}
responseelements    | null
additionaleventdata | NULL
requestid           | 09a5f980-13a2-48af-94d7-f27a2affbdbe
eventid             | 55979b8b-494f-4c8f-9cf9-3edaadefe142
resources           | NULL
eventtype           | AwsApiCall
apiversion          | NULL
readonly            | NULL
recipientaccountid  | 999999999999
serviceeventdetails | NULL
sharedeventid       | NULL
vpcendpointid       | NULL
region              | us-east-1
year                | 2018
month               | 04
eventversion        | 1.05
useridentity        | {type=AssumedRole, principalid=LKUWHE3545KJ34534L65U:user@example.com,
                      arn=arn:aws:sts::999999999999:assumed-role/rolename/user@example.com, 
                      accountid=999999999999, invokedby=null, accesskeyid=null, 
                      username=null, sessioncontext=null}
eventtime           | 2018-03-12T12:00:37Z
eventsource         | signin.amazonaws.com
eventname           | ConsoleLogin
awsregion           | us-east-1
sourceipaddress     | 198.51.100.144
useragent           | Chrome/65.0.3325.146
errorcode           | NULL
errormessage        | NULL
requestparameters   | null
responseelements    | {"ConsoleLogin":"Success"}
additionaleventdata | {"LoginTo":"https://console.aws.amazon.com/console/home?region=us-east-1",
                      "MobileVersion":"No","MFAUsed":"No",
                      "SamlProviderArn":"arn:aws:iam::999999999999:saml-provider/MySamlIdp"}
requestid           | NULL
eventid             | 96b00be0-6600-4489-8f94-3f70b04c4a66
resources           | NULL
eventtype           | AwsConsoleSignIn
apiversion          | NULL
readonly            | NULL
recipientaccountid  | 999999999999
serviceeventdetails | NULL
sharedeventid       | NULL
vpcendpointid       | NULL
region              | us-east-1
year                | 2018
month               | 03

Looking at these data sets, you get simpler queries. Looking at the first data set, here is a query which would have that record in its output, as well as others:

SELECT * FROM my_table_name 
  WHERE useridentity.username = 'iam-username' 
  AND year = '2018' 
  AND month  = '03';

In this query you can see that useridentity allowed dotted notation addressing of sub-fields which allows for very powerful queries using the Presto framework including Regular Expressions. The other columns which can be addressed normally via column names, again becoming much simpler.

Query and Performance Comparison

Now that we see the data is a bit easier to comprehend, how much easier is it to write? Also as important is how much faster and efficient is it to run? To do this test, I ran the following two queries against my largest account.

I am looking to find the 20 highest counts of a tuple of Event Names/ARN/SourceIP for March 2018 in us-east-1 only.

Previous Style

SELECT record.eventName, record.userIdentity.arn, record.sourceIPAddress, COUNT(*)
FROM
(SELECT record
FROM my_table_name
CROSS JOIN UNNEST(records) AS t (record)) AS records
WHERE record.eventtime LIKE '2018-03-%' and record.awsregion = 'us-east-1'
GROUP BY record.eventName, record.userIdentity.arn, record.sourceIPAddress
ORDER BY COUNT(*) DESC
LIMIT 20;

Current Style

select eventname, useridentity.arn, sourceipaddress, count(*) 
  from my_table_name 
  where year = '2018' 
  and month = '03' 
  and region = 'us-east-1' 
  group by eventname, useridentity.arn, sourceipaddress 
  order by count(*) DESC
  LIMIT 20

The old table and query format:

  • 449.72 seconds
  • Scanned 13.97GB of data
  • Cost $0.06985

The new table and query format:

  • 15.03 seconds
  • Scanned 1.2GB of data
  • Cost $0.006

Conclusion

In summary, the new table structure and queries are much faster, cheaper, and easier. In fact, on average, they are 500x cheaper and 400% faster. There really is little disadvantage to changing to the new schema.

Do you have any cool queries you wrote to summarize your data? I would love to hear from you.