JDBC TUTORIAL

Accessing Data using JDBC on AWS Glue

Updated: 05 Oct 2021

Introduction

Note: You can now connect to various data sources easily from AWS Glue using the Progress DataDirect Cloud Connectors that are available in the marketplace. Learn how to connect to Salesforce from AWS Glue Connectors in this new tutorial.

AWS Glue is an Extract, Transform, Load (ETL) service available as part of Amazon’s hosted web services. Glue is intended to make it easy for users to connect their data in a variety of data stores, edit and clean the data as needed, and load the data into an AWS-provisioned store for a unified view. 

Glue supports accessing data via JDBC, and currently the databases supported through JDBC are Postgres, MySQL, Redshift, and Aurora. Of course, JDBC drivers exist for many other databases besides these four. Using the DataDirect JDBC connectors you can access many other data sources  for use in AWS Glue. 

This tutorial demonstrates accessing Salesforce data with AWS Glue, but the same steps apply with any of the DataDirect JDBC drivers.

Download DataDirect Salesforce JDBC driver

  1. Download DataDirect Salesforce JDBC driver from here.
  2. To install the driver, you would have to execute the .jar package and you can do it by running the following command in terminal or just by double clicking on the jar package.
    java -jar PROGRESS_DATADIRECT_JDBC_SF_ALL.jar
  3. This will launch an interactive java installer using which you can install the Salesforce JDBC driver to your desired location as either a licensed or evaluation installation.
  4. Note that this will install Salesforce JDBC driver and bunch of other drivers too for your trial purposes in the same folder.

Upload DataDirect Salesforce Driver to Amazon S3

  1. Navigate to the install location of the DataDirect JDBC drivers and locate the DataDirect Salesforce JDBC driver file, named sforce.jar
  2. Upload the Salesforce JDBC JAR file to Amazon S3.

Create Amazon Glue Job

  1. Go to AWS Glue Console on your browser, under ETL -> Jobs, Click on the Add Job button to create new job. You should see an interface as shown below.

    AWSGlueJDBC1

  2. Fill in the name of the Job, and choose/create a IAM role that gives permissions to your Amazon S3 sources, targets, temporary directory, scripts, and any libraries used by the job. For this tutorial, we just need access to Amazon S3, as I have my JDBC driver and the destination will also be S3.
  3. Choose “A new script to be authored by you” under This job runs options.
  4. Give a name for your script and choose a temporary directory for Glue Job in S3.
  5. Under Script Libraries and job parameters(optional), for Dependent Jars path, choose the sforce.jar file in your S3. Your configuration should look as shown below.

    AWSGlueJDBC2

  6. Click on “Next” button and you should see Glue asking if you want to add any connections that might be required by the job. In this tutorial, we don’t need any connections, but if you plan to use another Destination such as RedShift, SQL Server, Oracle etc., you can create the connections to these data sources in your Glue and those connections will show up here.
  7. Click on Next, review your configuration and click on Finish to create the job.
  8. You should now see an editor to write a python script for the job. Here you write your custom Python code to extract data from Salesforce using DataDirect JDBC driver and write it to S3 or any other destination.

     

  9. You can use this code sample to get an idea on how you can extract data from data from Salesforce using DataDirect JDBC driver and write it to S3 in a CSV format. Feel free to make any changes to suit your needs. Save the job.

     

Code Sample

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
 
 
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
 
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
 
##Read Data from Salesforce using DataDirect JDBC driver in to DataFrame
source_df = spark.read.format("jdbc").option("url","jdbc:datadirect:sforce://login.salesforce.com;SecurityToken=<token>").option("dbtable", "SFORCE.OPPORTUNITY").option("driver", "com.ddtek.jdbc.sforce.SForceDriver").option("user", "user@mail.com").option("password", "pass123").load()
 
job.init(args['JOB_NAME'], args)
 
##Convert DataFrames to AWS Glue's DynamicFrames Object
dynamic_dframe = DynamicFrame.fromDF(source_df, glueContext, "dynamic_df")
 
##Write Dynamic Frames to S3 in CSV format. You can write it to any rds/redshift, by using the connection that you have defined previously in Glue
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dynamic_dframe, connection_type = "s3", connection_options = {"path": "s3://glueuserdata"}, format = "csv", transformation_ctx = "datasink4")
 
job.commit()

Run Glue Job

  1. Click on the Run Job button, to start the job. You can see the status by going back and selecting the job that you have created.
  2. After the Job has run successfully, you should now have a csv file in S3 with the data that you have extracted using Salesforce DataDirect JDBC driver.


You can use similar steps with any of DataDirect JDBC suite of drivers available for Relational, Big Data, Saas and NoSQL Data sources. Feel free to try any of our drivers with AWS Glue for your ETL jobs for 15-days trial period.

Connect any application to any data source anywhere

Explore all DataDirect Connectors

Need additional help with your product?

Get Customer Support