Monday, June 27, 2022
HomeBig DataConstruct an Apache Iceberg knowledge lake utilizing Amazon Athena, Amazon EMR, and...

Construct an Apache Iceberg knowledge lake utilizing Amazon Athena, Amazon EMR, and AWS Glue

Most companies retailer their essential knowledge in a knowledge lake, the place you’ll be able to deliver knowledge from numerous sources to a centralized storage. The information is processed by specialised massive knowledge compute engines, corresponding to Amazon Athena for interactive queries, Amazon EMR for Apache Spark purposes, Amazon SageMaker for machine studying, and Amazon QuickSight for knowledge visualization.

Apache Iceberg is an open-source desk format for knowledge saved in knowledge lakes. It’s optimized for knowledge entry patterns in Amazon Easy Storage Service (Amazon S3) cloud object storage. Iceberg helps knowledge engineers sort out complicated challenges in knowledge lakes corresponding to managing constantly evolving datasets whereas sustaining question efficiency. Iceberg means that you can do the next:

  • Keep transactional consistency the place recordsdata will be added, eliminated, or modified atomically with full learn isolation and a number of concurrent writes
  • Implement full schema evolution to course of protected desk schema updates because the desk knowledge evolves
  • Arrange tables into versatile partition layouts with partition evolution, enabling updates to partition schemes as queries and knowledge quantity modifications with out counting on bodily directories
  • Carry out row-level replace and delete operations to fulfill new regulatory necessities such because the Basic Information Safety Regulation (GDPR)
  • Present versioned tables and assist time journey queries to question historic knowledge and confirm modifications between updates
  • Roll again tables to prior variations to return tables to a identified good state in case of any points

In 2021, AWS groups contributed the Apache Iceberg integration with the AWS Glue Information Catalog to open supply, which lets you use open-source compute engines like Apache Spark with Iceberg on AWS Glue. In 2022, Amazon Athena introduced assist of Iceberg and Amazon EMR added assist of Iceberg beginning with model 6.5.0.

On this publish, we present you use Amazon EMR Spark to create an Iceberg desk, load pattern books overview knowledge, and use Athena to question, carry out schema evolution, row-level replace and delete, and time journey, all coordinated via the AWS Glue Information Catalog.

Answer overview

We use the Amazon Buyer Evaluations public dataset as our supply knowledge. The dataset comprises knowledge recordsdata in Apache Parquet format on Amazon S3. We load all of the book-related Amazon overview knowledge as an Iceberg desk to reveal some great benefits of utilizing the Iceberg desk format on prime of uncooked Parquet recordsdata. The next diagram illustrates our resolution structure.

Architecture that shows the flow from Amazon EMR loading data into Amazon S3, and queried by Amazon Athena through AWS Glue Data Catalog.

To arrange and take a look at this resolution, we full the next high-level steps:

  1. Create an S3 bucket.
  2. Create an EMR cluster.
  3. Create an EMR pocket book.
  4. Configure a Spark session.
  5. Load knowledge into the Iceberg desk.
  6. Question the information in Athena.
  7. Carry out a row-level replace in Athena.
  8. Carry out a schema evolution in Athena.
  9. Carry out time journey in Athena.
  10. Eat Iceberg knowledge throughout Amazon EMR and Athena.


To observe together with this walkthrough, it’s essential to have the next:

  • An AWS Account with a job that has adequate entry to provision the required assets.

Create an S3 bucket

To create an S3 bucket that holds your Iceberg knowledge, full the next steps:

  1. On the Amazon S3 console, select Buckets within the navigation pane.
  2. Select Create bucket.
  3. For Bucket identify, enter a reputation (for this publish, we enter aws-lake-house-iceberg-blog-demo).

As a result of S3 bucket names are globally distinctive, select a special identify if you create your bucket.

  1. For AWS Area, select your most popular Area (for this publish, we use us-east-1).

Create a new Amazon S3 bucket. Choose us-east-1 as region

  1. Full the remaining steps to create your bucket.
  2. If that is the primary time that you just’re utilizing Athena to run queries, create one other globally distinctive S3 bucket to carry your Athena question output.

Create an EMR cluster

Now we’re prepared to start out an EMR cluster to run Iceberg jobs utilizing Spark.

  1. On the Amazon EMR console, select Create cluster.
  2. Select Superior choices.
  3. For Software program Configuration, select your Amazon EMR launch model.

Iceberg requires launch 6.5.0 and above.

  1. Choose JupyterEnterpriseGateway and Spark because the software program to put in.
  2. For Edit software program settings, choose Enter configuration and enter [{"classification":"iceberg-defaults","properties":{"iceberg.enabled":true}}].
  3. Depart different settings at their default and select Subsequent.

Choose Amazon EMR release 6.6.0 and JupyterEnterpriseGateway and Spark. Enter configuration information.

  1. You possibly can change the {hardware} utilized by the Amazon EMR cluster on this step. On this demo, we use the default setting.
  2. Select Subsequent.
  3. For Cluster identify, enter Iceberg Spark Cluster.
  4. Depart the remaining settings unchanged and select Subsequent.

Provide Iceberg Spark Cluster as the Cluster name

  1. You possibly can configure safety settings corresponding to including an EC2 key pair to entry your EMR cluster regionally. On this demo, we use the default setting.
  2. Select Create cluster.

You’re redirected to the cluster element web page, the place you await the EMR cluster to transition from Beginning to Ready.

Create an EMR pocket book

When the cluster is lively and within the Ready state, we’re able to run Spark packages within the cluster. For this demo, we use an EMR pocket book to run Spark instructions.

  1. On the Amazon EMR console, select Notebooks within the navigation pane.
  2. Select Create pocket book.
  3. For Pocket book identify, enter a reputation (for this publish, we enter iceberg-spark-notebook).
  4. For Cluster, choose Select an present cluster and select Iceberg Spark Cluster.
  5. For AWS service function, select Create a brand new function to create EMR_Notebook_DefaultRole or select a special function to entry assets within the pocket book.
  6. Select Create pocket book.

Create an Amazon EMR notebook. Use EMR_Notebooks_DefaultRole

You’re redirected to the pocket book element web page.

  1. Select Open in JupyterLab subsequent to your pocket book.
  2. Select to create a brand new pocket book.
  3. Below Pocket book, select Spark.

Choose Spark from the options provided in the Launcher

Configure a Spark session

In your pocket book, run the next code:

%%configure -f
  "conf": {
    "spark.sql.catalog.demo": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.demo.catalog-impl": "",
    "spark.sql.catalog.demo.warehouse": "s3://<your-iceberg-blog-demo-bucket>",

This units the next Spark session configurations:

  • spark.sql.catalog.demo – Registers a Spark catalog named demo, which makes use of the Iceberg Spark catalog plugin
  • spark.sql.catalog.demo.catalog-impl – The demo Spark catalog makes use of AWS Glue because the bodily catalog to retailer Iceberg database and desk info
  • spark.sql.catalog.demo.warehouse – The demo Spark catalog shops all Iceberg metadata and knowledge recordsdata below the basis path s3://<your-iceberg-blog-demo-bucket>
  • spark.sql.extensions – Provides assist to Iceberg Spark SQL extensions, which lets you run Iceberg Spark procedures and a few Iceberg-only SQL instructions (you employ this in a later step)

Load knowledge into the Iceberg desk

In our Spark session, run the next instructions to load knowledge:

// create a database in AWS Glue named evaluations if not exist
spark.sql("CREATE DATABASE IF NOT EXISTS demo.evaluations")

// load evaluations associated to books
val book_reviews_location = "s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet"
val book_reviews = spark.learn.parquet(book_reviews_location)

// write e-book evaluations knowledge to an Iceberg v2 desk
book_reviews.writeTo("demo.evaluations.book_reviews").tableProperty("format-version", "2").createOrReplace()

Iceberg format v2 is required to assist row-level updates and deletes. See Format Versioning for extra particulars.

It might take as much as quarter-hour for the instructions to finish. When it’s full, it’s best to be capable of see the desk on the AWS Glue console, below the evaluations database, with the table_type property proven as ICEBERG.

Shows the table properties for book_reviews table

The desk schema is inferred from the supply Parquet knowledge recordsdata. You may also create the desk with a particular schema earlier than loading knowledge utilizing Spark SQL, Athena SQL, or Iceberg Java and Python SDKs.

Question in Athena

Navigate to the Athena console and select Question editor. If that is your first time utilizing the Athena question editor, it’s essential configure to make use of the S3 bucket you created earlier to retailer the question outcomes.

The desk book_reviews is offered for querying. Run the next question:

SELECT * FROM evaluations.book_reviews LIMIT 5;

The next screenshot exhibits the primary 5 data from the desk being displayed.

Amazon Athena query the first 5 rows and show the results

Carry out a row-level replace in Athena

Within the subsequent few steps, let’s give attention to a document within the desk with overview ID RZDVOUQG1GBG7. Presently, it has no complete votes after we run the next question:

SELECT total_votes FROM evaluations.book_reviews 
WHERE review_id = 'RZDVOUQG1GBG7'

Query total_votes for a particular review which shows a value of 0

Let’s replace the total_votes worth to 2 utilizing the next question:

UPDATE evaluations.book_reviews
SET total_votes = 2
WHERE review_id = 'RZDVOUQG1GBG7'

Update query to set the total_votes for the previous review_id to 2

After your replace command runs efficiently, run the beneath question and word the up to date outcome exhibiting a complete of two votes:

SELECT total_votes FROM evaluations.book_reviews
WHERE review_id = 'RZDVOUQG1GBG7'

Athena enforces ACID transaction assure for all of the write operations in opposition to an Iceberg desk. That is carried out via the Iceberg format’s optimistic locking specification. When concurrent makes an attempt are made to replace the identical document, a commit battle happens. On this situation, Athena shows a transaction battle error, as proven within the following screenshot.

Concurrent updates causes a failure. This shows the TRANSACTION_CONFLICT error during this scenario.

Delete queries work in an identical means; see DELETE for extra particulars.

Carry out a schema evolution in Athena

Suppose the overview all of the sudden goes viral and will get 10 billion votes:

UPDATE evaluations.book_reviews
SET total_votes = 10000000000
WHERE review_id = 'RZDVOUQG1GBG7'

Based mostly on the AWS Glue desk info, the total_votes is an integer column. Should you attempt to replace a price of 10 billion, which is bigger than the utmost allowed integer worth, you get an error reporting a kind mismatch.

Updating to a very large value greater than maximum allowed integer value results in an error

Iceberg helps most schema evolution options as metadata-only operations, which don’t require a desk rewrite. This consists of add, drop, rename, reorder column, and promote column sorts. To unravel this situation, you’ll be able to change the integer column total_votes to a BIGINT sort by operating the next DDL:

ALTER TABLE evaluations.book_reviews
CHANGE COLUMN total_votes total_votes BIGINT;

Now you can replace the worth efficiently:

UPDATE evaluations.book_reviews
SET total_votes = 10000000000
WHERE review_id = 'RZDVOUQG1GBG7'

Querying the document now provides us the anticipated end in BIGINT:

SELECT total_votes FROM evaluations.book_reviews
WHERE review_id = 'RZDVOUQG1GBG7'

Carry out time journey in Athena

In Iceberg, the transaction historical past is retained, and every transaction commit creates a brand new model. You possibly can carry out time journey to have a look at a historic model of a desk. In Athena, you need to use the next syntax to journey to a time that’s after when the primary model was dedicated:

SELECT total_votes FROM evaluations.book_reviews
FOR SYSTEM_TIME AS OF localtimestamp + interval '-20' minute
WHERE review_id = 'RZDVOUQG1GBG7'

Query an earlier snapshot using time travel feature

Eat Iceberg knowledge throughout Amazon EMR and Athena

Probably the most essential options of a knowledge lake is for various methods to seamlessly work collectively via the Iceberg open-source protocol. After all of the operations are carried out in Athena, let’s return to Amazon EMR and ensure that Amazon EMR Spark can devour the up to date knowledge.

First, run the identical Spark SQL and see in case you get the identical outcome for the overview used within the instance:

val select_votes = """SELECT total_votes FROM demo.evaluations.book_reviews
WHERE review_id = 'RZDVOUQG1GBG7'"""


Spark exhibits 10 billion complete votes for the overview.

Shows the latest value of total_votes when querying using the Amazon EMR notebook

Examine the transaction historical past of the operation in Athena via Spark Iceberg’s historical past system desk:

val select_history = "SELECT * FROM demo.evaluations.book_reviews.historical past"


This exhibits three transactions similar to the 2 updates you ran in Athena.

Shows snapshots corresponding to the two updates you ran in Athena

Iceberg presents quite a lot of Spark procedures to optimize the desk. For instance, you’ll be able to run an expire_snapshots process to take away previous snapshots, and liberate space for storing in Amazon S3:

import java.util.Calendar
import java.textual content.SimpleDateFormat

val now = Calendar.getInstance().getTime()
val kind = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val now_formatted = kind.format(now.getTime())
val process = s"""CALL demo.system.expire_snapshots(
  desk => 'evaluations.book_reviews',
  older_than => TIMESTAMP '$now_formatted',
  retain_last => 1)"""


Observe that, after operating this process, time journey can now not be carried out in opposition to expired snapshots.

Look at the historical past system desk once more and see that it exhibits you solely the latest snapshot.

Working the next question in Athena ends in an error “No desk snapshot discovered earlier than timestamp…” as older snapshots had been deleted, and you might be now not in a position to time journey to the older snapshot:

SELECT total_votes FROM evaluations.book_reviews
FOR SYSTEM_TIME AS OF localtimestamp + interval '-20' minute
WHERE review_id = 'RZDVOUQG1GBG7'

Clear up

To keep away from incurring ongoing prices, full the next steps to scrub up your assets:

  1. Run the next code in your pocket book to drop the AWS Glue desk and database:
// DROP the desk 
spark.sql("DROP TABLE demo.evaluations.book_reviews") 
// DROP the database 
spark.sql("DROP DATABASE demo.evaluations")

  1. On the Amazon EMR console, select Notebooks within the navigation pane.
  2. Choose the pocket book iceberg-spark-notebook and select Delete.
  3. Select Clusters within the navigation pane.
  4. Choose the cluster Iceberg Spark Cluster and select Terminate.
  5. Delete the S3 buckets and every other assets that you just created as a part of the conditions for this publish.


On this publish, we confirmed you an instance of utilizing Amazon S3, AWS Glue, Amazon EMR, and Athena to construct an Iceberg knowledge lake on AWS. An Iceberg desk can seamlessly work throughout two widespread compute engines, and you may benefit from each to design your personalized knowledge manufacturing and consumption use circumstances.

With AWS Glue, Amazon EMR, and Athena, you’ll be able to already use many options via AWS integrations, corresponding to SageMaker Athena integration for machine studying, or QuickSight Athena integration for dashboard and reporting. AWS Glue additionally presents the Iceberg connector, which you need to use to creator and run Iceberg knowledge pipelines.

As well as, Iceberg helps quite a lot of different open-source compute engines that you could select from. For instance, you need to use Apache Flink on Amazon EMR for streaming and alter knowledge seize (CDC) use circumstances. The robust transaction assure and environment friendly row-level replace, delete, time journey, and schema evolution expertise supplied by Iceberg presents a sound basis and infinite potentialities for customers to unlock the facility of massive knowledge.

In regards to the Authors

Kishore Dhamodaran is a Senior Options Architect at AWS. Kishore helps strategic clients with their cloud enterprise technique and migration journey, leveraging his years of trade and cloud expertise.

Jack Ye is a software program engineer of the Athena Information Lake and Storage workforce. He’s an Apache Iceberg Committer and PMC member.

Mohit Mehta is a Principal Architect at AWS with experience in AI/ML and knowledge analytics. He holds 12 AWS certifications and is obsessed with serving to clients implement cloud enterprise methods for digital transformation. In his free time, he trains for marathons and plans hikes throughout main peaks world wide.

Giovanni Matteo Fumarola is the Engineering Supervisor of the Athena Information Lake and Storage workforce. He’s an Apache Hadoop Committer and PMC member. He has been focusing within the massive knowledge analytics house since 2013.

Jared Keating is a Senior Cloud Advisor with AWS Skilled Companies. Jared assists clients with their cloud infrastructure, compliance, and automation necessities, drawing from his 20+ years of IT expertise.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments