How ETL Operates With Amazon Glue

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean that data, enrich it, and move it between various data stores. AWS Glue consists of a central data repository which is known as the AWS Glue Data Catalog, an ETL engine which automatically generates Python code, and a scheduler which handles the dependency resolution, job monitoring and retries. AWS Glue is server-less, so there's no infrastructure to manage.

It signifies that you just have to concentrate on building your jobs and scripting your business logic, rather than building servers, installing tools and ensuring the focus on when jobs need to run.

Event-Driven Or Scheduling

You can either schedule your jobs to run at predefined intervals, or you can have them run based on triggers on S3 buckets. For example you can set up a Lambda function to trigger your job whenever a new file is dropped in specific bucket.

What It Does?
1. It collects information about your data sources. This includes where the data is stored, and the underlying schema of that data.

2. It builds transformations between data sources. AWS Glue uses crawlers to inspect your variable schemas, and auto-generates the necessary code to transform from source to destination.

3. It manages Jobs to move the data, allowing for powerful scheduling and retry possibilities.

4. It seamlessly integrates with other AWS Services, including S3 and Amazon Redshift Spectrum.

How It works?

Setup The Crawler

With having data in hand, the next step is to point AWS Glue Crawler to data. The crawler inspects the data and generate a schema describing what it finds. While AWS Glues supports various custom classifiers for complicated data sets.

Create A Job

With the schema in place, we can create a Job. We don't need any fancy scheduling here, just need it to execute.

AWS Glue offers a GUI to define your input/output mappings, or you can just edit the script directly. For this simple example, I removed some of the output fields (so we're effectively reducing the number of columns in our output data set)

Upon successful completion of our job, we now have a (transformed) data set in our S3 storage!

You can write your own ETL scripts using Python or Scala.

S3 Data Into AWS RedShift

AWS Redshift is a powerful Data Warehouse solution, and perfect for our needs. Utilising the "COPY" command, we can easily copy our data into AWS Redshift which is transferred to glue from s3.

This shows us how easy, fast and scalable it is to crawl, merge and write data for ETL operations using Glue, a very good service provided by Amazon Web Services.

Labels

Thursday, 28 March 2019