Accessing data from any REST API in AWS Glue using JDBC

Introduction

AWS Glue is an Extract, Transform, Load (ETL) service available as part of Amazon’s hosted web services. Glue is intended to make it easy for users to connect their data in a variety of data stores, edit and clean the data as needed, and load the data into an AWS-provisioned store for a unified view. 

A lot of organizations now use REST APIs to expose and consume data.  We often see from that they also want to store this data coming from the REST APIs to provide real time business intelligence or analytics. The problem with this approach is that each of these REST APIs are built differently.  Their authentication schemes differ, their response structures differ and when you want to bring in this data into AWS RedShift, S3 or EMR Hive using AWS Glue, you end up writing a lot of code for each of these services.  This can mean a lot of unnecessary effort.

With Progress DataDirect Autonomous REST Connector, you can connect to any REST API without you having to write a single line of code and run SQL queries to access the data via a JDBC interface. In this tutorial we will show how you can use Autonomous REST Connector with AWS Glue to ingest data from any REST API into AWS Redshift, S3, EMR Hive, RDS etc., We will be using the Yelp API for this tutorial and we’ll use AWS Glue to read the API data using Autonomous REST Connector.  Finally, we’ll write it to S3.

Download and Install Autonomous REST Connector

  1. Download the JDBC Autonomous REST Connector.
  2. Install the connector by running the setup executable file on your machine and following the instructions on the installer.
  3. After the installation has completed, you should find Autonomous REST Connector at the following default path, unless you have chosen to install it at a different location:

  4. C:\Program Files\Progress\DataDirect\JDBC_60\lib\autorest.jar

Get your Yelp API Key

  1. To use the Yelp Fusion API, you’ll need to register as a developer and create an app on the Yelp developer website.

  2. Provide the App Name, Industry, Contact Email and Description to create your App. You should now see the ClientID and API Key on your screen, allowing you to authenticate with Yelp’s API.

Configuring Autonomous REST Connector for Yelp

  1. To connect to the Yelp API using Autonomous REST Connector, point it at the endpoint using default values. 

  2.  For the tutorial, I will connect to the Business Search endpoint offered by Yelp. This allows me to get all businesses in my area or to search specific business categories.

  3. To configure the driver to connect to the this endpoint, use the following JDBC URL:


    jdbc:datadirect:autorest:sample=http://api.yelp.com/v3/businesses/search?location=27617;AuthenticationMethod=HttpHeader;AuthHeader=Authorization;SecurityToken='Bearer <Your API Key>'

  4. When you connect to a REST API using Autonomous REST Connector, it will automatically sample the API and create a configuration, which you can access by querying the _CONFIGURATION table. You can get this configuration by using Autonomous REST Connector in any SQL querying tool like Dbeaver, Squirrel SQL etc.,

  5. For this tutorial, download this config file from GitHub and save it as yelp.rest.

  6. If you review the configuration, you notice that Autonomous REST Connector has detected all the objects and their data types.

  7. To learn more about Autonomous REST Connector and how you can configure it to connect to multiple endpoints, we recommend you go through these other tutorials after you have finished this one.

Getting Started

Connecting to Multiple Endpoints

Editing Auto generated Schema

Upload dependencies to S3

  1. Before we start writing the Glue ETL job script, you will need to upload the Autonomous REST Connector autorest.jar file (from the install location) and the yelp.rest file to S3.

  2. You can find the autorest.jar in the lib folder of the install location you chose in the previous section.

Create Amazon Glue Job

  1. Open the AWS Glue Console in your browser.  Under ETL -> Jobs, click the Add Job button to create a new job. You should see an interface as shown below:
    glue configuration

  2. Fill in the name of the job, and choose/create an IAM role that gives permissions to your Amazon S3 sources, targets, temporary directory, scripts, and any libraries used by the job. For this tutorial, we just need access to Amazon S3, as I have my JDBC driver and the destination will also be S3.
    1. Choose “A new script to be authored by you” under This Job Runs options.
    2. Name for your script and choose a temporary directory for Glue Job in S3.
    3. Under Security Configuration, Script Libraries and job parameters(optional), for Dependent Jars path, choose the autorest.jar file in your S3.
    4. For Referenced files path choose yelp.rest file, which you uploaded to S3. Your configuration should like below after you are done.

      full glue configuration

     

  3. Click the Next button. Glue will ask if you want to add any connections that might be required by the job. For this tutorial we don’t need any connections, but if you plan to use another destination such as RedShift, SQL Server, Oracle etc., you can create the connections to these data sources in Glue and those connections will show up here.
  4. Click Next, review your configuration and click Finish to create the job.
  5. You should now see an editor to write a python script for the job. Write your custom Python code to extract data from the Yelp API using DataDirect Autonomous REST Connector and write it to S3 or any other destination.
  6. Download this code sample from GitHub and paste it in to your code editor.

Note: Don’t forget to provide valid API Key in JDBC connection URL.

Run Glue Job

  1. Click on the Run Job button to start the job. You can see the status by going back and selecting the job that you have created.
  2. After the Job has run successfully, you should have a csv file in S3 with the data that you extracted using Autonomous REST Connector.
Status


This is just one example of how easy and painless it can be with Progress DataDirect Autonomous REST Connector to pull data into AWS Glue from any REST API. Feel free to try the connector with any application you want. If you have any questions or issues, please contact us or comment below.


JDBC TUTORIAL

Accessing data from any REST API in AWS Glue using JDBC

DataDirect
connectors

Connect any data source to any application

DataDirect
Tutorials

Discussions, tips and tricks for
DataDirect Connect drivers