Integrating Call Log Data into Amazon Redshift

Introduction
Set up Amazon Redshift
Additional resources:

Introduction

The TIBCO Mashery Enriched Call Log Feature is a great new way to get API performance and usage data associated with the traffic flowing through the Mashery API Platform.

An overview of the feature is available here. The full documentation for setting up the export of the data to S3 is here in the documentation.

One key aspect of the feature is that you need to have an Amazon S3 bucket setup in order to receive data from Mashery. Once setup (See the documentation link above) in the Masher Control Centre, you'll start to recieve Zipped CSV files containing your API transaction data.

Now that you have your data in a S3 bucket you have a number choices as to what ou can do with it. One of the great things about Amazon S3 is that it serves as a data foundation for many other Amazon services. Amazon Redshift is a great data warehousing technology which can be used as the data layer for more advanced analytical tools like TIBCO Spotfire, TIBCO JasperSoft, among other independent 3rd party technologies.

This Article shows how easy it is to get the data from S3 and load it to Redshift.

Set up Amazon Redshift

The first, step is to configure your Amazon account for Redshift access and set up a cluster to hold the data.

It is recommended that this is done in the same Amazon region that contains the S3 bucket storing the exported Mashery data.

Step 1 - Create an IAM Role

The IAM role will allow Redshift to access Amazon services, like S3, of your behalf.

Sign on to your Amazon management console and select IAM from the Services menu. This should display the IAM dashboard:

Select the Roles menu

Click Create new Role

Select AWS Service for the trusted entity type. Select Redshift as the service and then select the use case of Redshift - Customizable.

Use the Filter to find S3 Policies. Select the default AmazonS3FullAccess policy. [Note - this policy maybe too restrictive, or too open depending on your usecases and security requirements. You can create you own Policies if required, to tailor the security requirements]

Lastly name the Role and save it.

Step 2 - Create the RedShift cluster

Navigate to the Redshift Dashboard from the Services menu

At this stage you could use an existing Cluster to add the data too, but we'll create a new cluster by Clicking Launch Cluster

It is recommenced that you use the same Amazon Region that contains your S3 bucket. This will enable the data load to perform better as it doesn't have to move data across regions.

Name the cluster and fill in any required parameters. Click Continue

Select the Node type, and Cluster type. [ds2.xlarge is probably the best balance of performance v cost to testing out this feature.

Validate the settings for VPC and VPC Security Group. Select the Role you defined in step 1.

Review the settings and launch the cluster.

Step 3 - Creating the database

Once the cluster has launched you can view all the relevant settings on the Cluster page. Selecting the Cluster name will display the Cluster configuration page. Under Cluster Database Properties this will show you the key parameters you need to connect to the DB.

To manage the DataBase you will need access to a SQL client. You can use any Client that supports the Postgres database. These instructions show the use of SQLWorkbench/J which is a popular open source client. There are complete install instructions here in the AWS documentation on how to install the client and the required drivers for RedShift.

Once you have installed the client and/pr drivers you can connect to the Redshift database using the properties as shown.

Once connected you can use the following SQL to add the Mashery schema into the Amazon Redshift cluster.

 create table api_transactions(     request_host_name varchar(100),     src_ip varchar(100),     request_uuid varchar(100),     http_method varchar(100),     uri varchar(2500),     http_version varchar(100),     bytes bigint,     http_status_code varchar(100),     referrer varchar(256),     user_agent varchar(256),     request_id varchar(256),     request_time timestamp,     api_key varchar(256),     service_id varchar(256),     traffic_manager varchar(256),     api_method_name varchar(256),     cache_hit boolean,     traffic_manager_error_code varchar(100),     total_request_exec_time numeric(10,4),     remote_total_time numeric(10,4),     connect_time numeric(10,4),     pre_transfer_time numeric(10,4),     oauth_access_token varchar(256),     ssl_enabled boolean,     quota_value varchar(100),     qps_throttle_value varchar(100),     client_transfer_time numeric(10,4),     service_name varchar(256),     response_string varchar(256),     plan_name varchar(256),     plan_uuid varchar(256),     endpoint_name varchar(256),     package_name varchar(256),     package_uuid varchar(256),     service_definition_endpoint_uuid varchar(256) );

Step 4 - Load Data - Scripted

Attached to this article is a Python 3 script that will load the data to Redshift. Any language that has libraries that support Postgress and Amazon API's could be used if you would prefer.

The script requires that the a couple of Python packages are installed. The first is called psycopg which is a database driver for PostgresSQL. The second is boto3 which is a Python library for AWS. See here for information on how to install Psycopg, and here for the quickstart guide to boto3.

Note - You MUST create/configure the AWS credential file. In the script the config name is required to establish the boto3 session.

The script is fairly basic. It takes a set of input parameters that identify the input datasource and the Redshift target. Key to the script is the COPY data statement.

The key line is:

 cursor.execute('copy ' + redshift_table_name + ' from \'' + file_key + '\' credentials \'' +                s3_role + '\'  IGNOREHEADER 1 timeformat \'YYYY-MM-DDTHH:MI:SSZ\' gzip                 delimiter \',\' region \'' + s3_region + '\'')

Here we are simply executing a standard Redshift function to copy data into the Redshift table from an S3 bucket. A couple of important points:

The data in your S3 bucket, as received from Mashery, will be gzip compressed, hence the "gzip" in the above code snippet.
The data includes a header, hence why we ignore the first line.
The data is CSV formatted, hence the delimiter choice.
Getting the timeformat is key to making sure that particular data field, request_time, which is a timestamp, gets loaded into Redshift as a usable data format.

A sample invocation of the script might look like this:

python sample_redshift_data_loaderP3.py mashcallexport us-east-1 aws_iam_role=arn:aws:iam::999999999999:role/RedshiftRole mashery-ecle.cyltgr9csnft.us-east-1.redshift.amazonaws.com 5439 data <username> <password> api_transactions

Step 5 - Load Data - Automation via Lambda

Perhaps a more interesting scenario would be to automate the upload to RedShift. An easy way to do this would to be to use a Amazon Lambda function to respond to the amazon S3 creation event.

Amazon Lambda allows for the excution of code based on an event. In the case of S3 this can be the creation of a file in a bucket. Therefore it will trigger whenever a file is created by Mashery, and automatically load to RedShift so that the data is available as soon as possible.

To do this navigate to the Lambda Dashboad from the AWS Services menu

Click on Create function

Lambda supports functions written in a number of languages but as we have used Python3 for our script, lets use Python3 for our function. On the Select Blueprint page, select the definition you want to base the function on. You can narrow the choices by selecting Python3 and there is a S3 example that you can use to hand code your function. [You can use the Blank function to upload the supplied sample. Selecting the S3 example, will ease the trigger definition so this would be the preferred choice.]

First we enter the Basic information. Enter a Name and a Role name. Scroll down and set the trigger values.

Configure the trigger, by selecting the Bucket, the event type and optionally the prefix and suffix of the file. At this stage you can enable the trigger, or choose to enable it once we have completed the process. Scroll to the bottom and click Create Function.

Our function has been created with a default function. The Python script we are going to use needs to use two packages. Boto3 is available automatically, but the database driver is not. We can upload this by uploading a zip file that contains our script and a directory that contains the correct Python3 drivers. Attached to the post is a sample that can be uploaded to create the correct files.

In the Function code section, click on the Code entry type drop down and select Upload a .ZIP file. Click on Function package Upload button to select the file you downloaded. You also need to change the Handler value to copy.handler. Also check that the role is correct set to the role you created when the function was created.

Click on Save at the top of the page and it will upload the code and display the function

Once the function has been created you will see the uploaded code. This code will need to be amended to specify your values for the variables in the script. Note - it is possible to migrate these variables to Environment variable rather than specifying them in the function script.

Lastly the function wizard will have created a Role called RedShiftLambdaRole. Go to the IAM dashboard from the Services Menu and select Roles in the left hand sidebar. Now select the role just created.

We now need to allow this role to delete the S3 data in the trigger. Click on Attach Policy.

Now find and select the S3FullAccess role

At this stage your function can be tested with a sample file, It should read and process the file to Redshift.

Step 6 - Connecting to SpotFire

For the purposes of this Tutorial we will use Spotfire Cloud edition, for which you can get access to a Trial version here.

Sign in to Spotfire Cloud here, and navigate to the Account settings page. In the left hand menu select Downloads. Under connectors click on Show list of drivers.

Select and download the Redshift Connector.

This will download a zip file containing the latest Amazon redshift ODBC driver. Install the appropriate driver for your system (32bit v 64bit)

Open the Spotfire Cloud desktop client.

On the Add Data page, you can now select Amazon Redshift under the Add Data Connection driver list.

This will bring up a dialogue that request you to enter your connection parameters:

Fill in the data as required. The Amazon Redshift dashboard will provide you with the information you required (as above with the SQLworkbench tool.

The essential parts are:

Take the server value and add the port number, joined with a colon:

server.addr.amazonaws.com:5439

Use the database name in the database field.

Once you have your data opened in Spotfire you can start to create reports based on the captured data.

The sample above is attached to this post as a starting point for your own custom reports.

Sign In

Integrating Call Log Data into Amazon Redshift

Table of Contents

Introduction

Set up Amazon Redshift

Table of contents

User Feedback

Recommended Comments

Create an account or sign in to comment

Create an account

Sign in

Activity