Getting started with Apache Parquet

This article explains how to reverse-engineer Apache Parquet files in Semarchy xDI, and use them in supported mappings. Apache Parquet files are open-source files used for storing data in columnar formats.

Create a metadata

To create an Apache Parquet metadata:

  1. Right-click a folder in your project and then select New > Metadata.

  2. In the New Metadata wizard, select Parquet and then click Next.

    getting started metadata

  3. Name the metadata and click Next.

  4. Select the installed module and click Finish.

The metadata is created with a root Schemas node. This node will contain the Apache Parquet files you will reverse-engineer.

getting started root node

Reverse-engineer Apache Parquet files

To reverse-engineer an Apache Parquet file:

  1. Right-click the Schemas node and select New > Schema.

    getting started schema

  2. Select the newly created Schema node. Set the File Path property to the path of the Apache Parquet file.

  3. Right-click the Schema node and select Action > Reverse.

The Apache Parquet file is reverse-engineered as a schema node and is ready to use in mappings.

Create mappings

Drag and drop the Apache Parquet files (the schema nodes) defined in your metadata into mappings.

The following mapping scenarios are supported:

  • Read Parquet on HDFS to Spark.

  • Write a Spark Dataset to Parquet on HDFS.

  • Export a Vertica table to Parquet files on HDFS or on another supported file system.

Examples

The following example shows how to load a Apache Parquet file located on HDFS into Spark:

getting started mapping send

The following example shows how to export a Spark Dataset to a Apache Parquet file on HDFS.

getting started mapping receive