HBase importtsv tool

Overview

HBase is shipped with a command line tool named 'importtsv' which can be used to load data from HDFS into HBase tables efficiently.

When massive data need to be loaded into an HBase table, this can be useful to optimize performances and resources.

Semarchy xDI Templates offer the possibility to load data from any database into HBase using this tool with little configuration in Metadata and Mapping.

Prerequisites

There are some prerequisites to be able to use the importtsv tool with Semarchy xDI, listed below:

  • An HBase Metadata

  • An HDFS temporary folder to store temporary data is required

  • An SSH Metadata that will be used to run the importtsv command on the remote Hadoop server.

  • (Optional) The Kerberos Keytab path on the remote Hadoop server if it is secured by Kerberos

The next sections of this page consider that you already have those resources at your disposal.

Refer to Getting started with the Hadoop component for further information about HBase Metadata.

Configure HBase metadata for importtsv

HDFS temporary folder

As the importtsv tool purpose it to load data from HDFS to HBase, we need a temporary HDFS folder to store source data before loading it to the target table.

Simply drag and drop the HDFS folder Metadata link you want to use as temporary folder into the HBase Metadata.

Then, rename it to 'HDFS':

importtsv tool hdfs temporary folder

Refer to Getting started with the Hadoop component for further information about HDFS Metadata configuration

Sqoop utility (optional)

Default behavior is to send temporary data into HDFS using HDFS APIs.

Note that you also have the possibility to configure it to be sent into HDFS through the Sqoop Hadoop utility, if you prefer. This is optional.

If you want to do so, drag and drop a Sqoop Metadata Link in the Metadata of the HDFS temporary folder you defined in previous section.

Then, rename it to 'SQOOP':

importtsv tool use sqoop to load hdfs

Refer to Getting started with the Hadoop component for further information about Sqoop Metadata configuration

Specify the remote server information

Specify an SSH connection

The command will be executed through SSH on the remote Hadoop server.

The HBase Metadata therefore requires the information about how to connect to this server.

Simply drag and drop an SSH Metadata Link containing the SSH connection information in the HBase Metadata.

Then, rename it to 'SSH':

importtsv tool ssh link

Templates only support executing the command through SSH at the moment.

We’re working on updating them to add an alternative to also be able to execute it locally to the Runtime if required, without needing an SSH connection.

Specify the Kerberos keytab path (optional)

If the Hadoop cluster is secured with Kerberos, an authentication must be performed on the server before executing the command.

As the command is started through SSH, you need to indicate where is located the Keytab that must be used to authenticate on the remote server.

For this simply specify the 'Kerberos Remote Keytab File Path' in the Kerberos Principal used by HBase.

importtsv tool kerberos additional setting

Refer to Getting started with Kerberos for more information.