Using AWS Athena to Query CloudTrail Logs

By Thomas Vachon

Athena and CloudTrail: A Marriage made in the Cloud

One of the first things which came to mind when AWS announced AWS Athena at re:Invent 2016 was querying CloudTrail logs. Over the course of the past month, I have had intended to set this up, but current needs dictated I had to do it quickly. When I went looking at JSON imports for Hive/Presto, I was quite confused. Of course, as a trusty technologist I went to Google. Much to my surprise, no one had published an article about using Athena to do this, I was only able to locate EMR based posts which used a custom serde to support the nested CloudTrail format.

I had mild success at first, but thanks to some Athena guru’s, I was able to get the magic piece in place.

I have to provide credit to AWS for their help with a few issues and amazing documentation on the event types.

I have provided references at the end of each field section and the end of the post with specific and broader details for the event fields and their uses.

To set all of this up, you first must have your CloudTrail logs in a single S3 bucket, this will work with a single account or many, but I purposely set up delivery to a single bucket but I created a table per source in Athena under a common database.

This is an example create table which will provide the table/field sytax formats I used in the tables below.

         Records ARRAY< STRUCT< eventName: STRING,
         requestParameters: STRUCT< instancesSet: STRUCT< items: ARRAY< STRUCT< instanceId: STRING >>>,
         volumeSet: STRUCT< items: ARRAY< STRUCT< volumeId: STRING > > > >,
         eventType: STRING,
         eventSource: STRING,
         sourceIPAddress: STRING,
         userIdentity: STRUCT< arn: STRING,
         principalId: STRING,
         accountId: STRING,
         invokedBy: STRING,
         TYPE: STRING,
         sessionContext: STRUCT< sessionIssuer: STRUCT< arn: STRING,
         principalId: STRING,
         accountId: STRING,
         TYPE: STRING,
         userName: STRING >,
         attributes: STRUCT< creationDate: STRING,
         mfaAuthenticated: STRING > > >,
         eventVersion: STRING,
         responseElements: STRUCT< credentials: STRUCT< accessKeyId: STRING,
         expiration: STRING,
         sessionToken: STRING >,
         assumedRoleUser: STRUCT< arn: STRING,
         assumedRoleId: STRING > >,
         userAgent: STRING,
         eventID: STRING,
         awsRegion: STRING,
         sharedEventID: STRING,
         eventTime: STRING,
         resources: ARRAY< STRUCT< accountId: STRING,
         TYPE: STRING,
         ARN: STRING > >,
         requestID: STRING,
         recipientAccountId: STRING >>
WITH serdeproperties( 'ignore.malformed.json' = 'true' )
LOCATION 's3://my_consolidated_bucket/my-cross-account-prefix/AWSLogs/'

CloudTrail Record Query Columns

These are the columns you can reference in your queries, I have grouped them by purpose. This is not a full list of all CloudTrail fields, so if you need others such as VpcEndpoint, you should add that to the schema.

Event ID Fields

record.eventID GUID generated by CloudTrail to uniquely identify each event
record.sharedEventID GUID generated by CloudTrail to uniquely identify CloudTrail events from the same AWS action that is sent to different AWS accounts

Event Details

record.eventName The requested action, which is one of the actions in the API for that service. (example: DescribeLoadBalancers)
record.eventSource The service that the request was made to (e.g.
record.eventTime The date and time the request was made, in coordinated universal time (UTC)
record.eventType Identifies the type of event that generated the event record, one of AwsApiCall, ConsoleSignin, AwsServiceEvent (related to the trail itself, this can occur when another account made a call with a resource that you own)
record.eventVersion The version of the log event format
record.sourceIPAddress The IP address that the request was made from, when console is used, it will report

Request Details

record.requestId The value that identifies the request, generated by the service being called
record.requestParameters The parameters, if any, that were sent with the request

Resource Details

record.resources An array of the resources accessed in the event, used most often by STS or KMS
record.resources.accountId The account ID of the impacted element

Response Details

record.responseElements.assumedRoleUser.arn The arn of the assumed role for the unique session
record.responseElements.assumedRoleUser.assumedRoleId The ID of the assumed role for the unique session
record.responseElements.credentials.accessKeyId The access key of the caller
record.responseElements.credentials.expiration The expiration of the current session
record.responseElements.credentials.sessionToken The active token for the session References



record.userAgent The agent through which the request was made
record.recipientAccountId Represents the account ID that received this event, may differ from the calling account if cross-account access occurred and will differ on the "remote" end

User Identity

record.userIdentity.accountId The account that owns the entity that granted permissions for the request
record.userIdentity.arn The Amazon Resource Name (ARN) of the principal that made the call
record.userIdentity.invokedBy The name of the AWS service if that made the request
record.userIdentity.principalId A unique identifier for the entity that made the call. For requests made with temporary security credentials, this value includes the session name that is passed to the AssumeRole, AssumeRoleWithWebIdentity, or GetFederationToken API call
record.userIdentity.sessionContext.attributes.creationDate The date and time when the temporary security credentials were issued
record.userIdentity.sessionContext.attributes.mfaAuthenticated The value is true if the root user or IAM user whose credentials were used for the request also was authenticated with an MFA device; otherwise, false
record.userIdentity.sessionContext.sessionIssuer.accountId The account that owns the entity that was used to get credentials
record.userIdentity.sessionContext.sessionIssuer.arn The internal ID of the entity that was used to get credentials
record.userIdentity.sessionContext.sessionIssuer.type The source of the temporary security credentials, such as Root, IAMUser, or Role
record.userIdentity.sessionContext.sessionIssuer.userName The friendly name of the user or role that issued the session. The value that appears depends on the sessionIssuer identity type. See reference material for more information
record.userIdentity.type The type of the identity which is one of: Root, IAMUser, AssumedRole, FederatedIsr AWSAccount (cross-account access), AWSService (Access performed by an AWS service such as Elastic Beanstalk)


Example Queries

Find all event names by ARN by IP address and count them up as the highest totals

SELECT record.eventName, record.userIdentity.arn, record.sourceIPAddress, COUNT(*)
(SELECT record
FROM my_table_name
CROSS JOIN UNNEST(records) AS t (record)) AS records
GROUP BY record.eventName, record.userIdentity.arn, record.sourceIPAddress

Find all events where cross-account access occurred, group them by the source and the ARN and count the totals

SELECT record.eventName, record.eventSource, record.userIdentity.arn, COUNT(*)
(SELECT record
FROM my_table_name
CROSS JOIN UNNEST(records) AS t (record)) AS records
WHERE record.recipientAccountId <> record.userIdentity.accountId
GROUP BY record.eventName, record.eventSource, record.userIdentity.arn

Document Reference: