How Can We Transform JSON / CSV files to Parquet through Aws Glue?
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there’s no infrastructure to set up or manage.
We can convert to the below formats.
- CSV
- JSON
- Parquet
- Avro
- ORC
Why Parquet?
- Apache Parquet is implemented using the record-shredding and assembly algorithm, which accommodates the complex data structures that can be used to store the data.
- Parquet is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types.
This approach is best, especially for those queries that need to read certain columns from a large table. Mostly we are using the large files in Athena. Parquet can only read the needed columns therefore greatly minimizing the IO. So we can have better control over performance and the cost.
Let’s start to create!
Create S3 Bucket
First of all, we will create two buckets which one of them for input(json, csv), another of them for output(parques).
- mytestbucketjson for our json.
- mybucketparquet for our output.
- we will add this dummy json into mytestbucketjson S3 bucket.
Configure the Crawler
- Infers metadata from logs/table, crawler is used to catalog your data.
- So, we add these data to glue table regarding the identifier. Then these datas are searchable, discovareable and this tabling can be done by the crawler.
- Crawler is basically scanning your data to learn about the data format and schema information.
- Go to AWS Glue page.
- Click Crawlers then -> add Crawler.
- Enter the Crawler name.
- If you want to specific setting you can.
- Add path or specific file. (After create glue job, we will be transforming json to parquet)
- If you don’t want to add another data store, click No.
- We need to create a role that is AWSGlueServiceRole.
- We can also schedule our crawler. I will choose Run on demand mode.
- Then, we will add our db name which is created by crawler.
- after review, click finish button.
- That’s all, we have just created the crawler :)
Now, as you remember, we created two S3 Buckets, we need to add S3 access in AWSGlueServiceRole crawler to write parquet file into mybucketparquet.
- Go to IAM -> roles then choose AWSGlueServiceRole
- Add permission -> Create Inline policy.
- Add S3, you can give just write permission.
- Review and create.
- We have finished all crawler process, only run is left.
- After Glue job, we will run the crawler.
Create Glue Job
- Typically, a job runs extract, transform, and load (ETL) scripts.
- Aws Glue -> Jobs.
- We can write and run our scripts, but we will choose Visual with a source and target as it will be the easy way.
- Click Data source — S3 bucket.
- Click Browse S3 and choose our dummy json.
- Then, configure the Data target.
- We will choose Format -> Parquet , S3 Target Location -> mybucketparquet.
- Click Job details.
- Add a Name.
- Choose IAM Role that we created.(AWSGlueServiceRoleTest).
- Then save and run the job.
When we look at the S3 Bucket, we will see the parquet file.
Run Crawler
- Glue -> Crawler and Run the crawler.
- Then click the Database section on Glue.
- Click db name
- We will see the column name as users.
- To see the elements, click users column name.
Amazon Athena
- Amazon Athena is an interactive query system that makes it easy for you to analyze data using standard SQL on Amazon S3.
- Athena is serverless so there is no infrastructure to manage and you only pay for the queries you run.
- You can implement and use amazon athena easily.