Getting Started with HBase

Overview

This getting started gives some clues to start working with HBase

Connect to your Data

The first step, when you want to work with HBase in Semarchy xDM Data Integration, consists of creating and configuring the HBase Metadata.

Here is a summary of the main steps to follow:

  1. Create the HBase Metadata

  2. Configure the Metadata

  3. Configure Kerberos security (optional)

  4. Reverse the namespaces and tables

Below, a quick overview of those steps:

getting started hbase overview

Create the Metadata

Create first the HBase Metadata, as usual, by selecting the HBase technology in the Metadata Creation Wizard.

Choose a name for this Metadata and go to next step.

Configure the Metadata

Your freshly created HBase Metadata being ready, you can now start configuring the server properties, which will define how to connect to HBase.

getting started hbase server props

The following properties are available:

Property

Mandatory

Description

Example

HBase Zookeeper Quorum

Yes

Comma separated list of HBase servers in the Zookeeper Quorum

quickstart.cloudera

HBase Zookeeper Client Port

Yes

Network port on which the client will connect

2181

Hadoop Configuration Files

Recommended

Hadoop stores information about the services properties in configurations file such as core-site.xml and hbase-site.xml.

These files are XML files containing a list of properties and information about the Hadoop server.

Depending on the environment, network, and distributions, these files might be required to be able to contact and operate on HBase.

There is therefore the possibility to specify these files in the Metadata to avoid network and connection issues, for instance.

For this simply specifies them with a comma separated list of paths pointing to their location. They must be reachable by the Runtime.

D:\hadoop\core-site.xml,D:\hadoop\hbase-site.xml

Number of rows to scan to find columns

No

HBase doesn’t store Metadata about the columns that exist in a table.

We therefore need to read rows from the table to find existing columns from data itself.

This property allows to specify the number of lines that should be scanned to find the available columns to reverse.

100 (default value if not set)

Namespace Filter

No

Optional regular expression used to filter the namespaces to reverse.

The '%' character can be used to define 'any character'.

*

stb_hadoop_%
default

Default DataType

No

This property offers the possibility to define the default type to use, when the type is not defined on a column in the Metadata.

Note that HBase stores everything as bytes and doesn’t have a notion of datatypes.

The datatypes specified in Semarchy xDM Data Integration helps to define how to manipulate data when reading from HBase.

See Datatypes Management for more information.

string

Default String Precision

No

Default precision (size) to be used for string columns.

As for the Datatypes, this is used to help Semarchy xDM Data Integration manipulating data when reading from HBase.

255

Configure the Kerberos Security (optional)

When working with Kerberos secured Hadoop clusters, connections will be protected, and you’ll therefore need to specify in Semarchy xDM Data Integration the credentials and necessary information to perform the Kerberos connection.

A Kerberos Metadata is available to specify everything required.

  1. Create a new Kerberos Metadata (or use an existing one)

  2. Define inside the Kerberos Principal to use for HBase

  3. Drag and drop it in the HBase Metadata

  4. Rename the Metadata Link to 'KERBEROS'

Below, an example of those steps:

getting started hbase kerberos link

Refer to Getting Started With Kerberos for more information.

Reverse of the namespaces and tables

Your Metadata is now ready to begin reversing the existing namespaces and tables.

When performing the reverse, Semarchy xDM Data Integration connects to HBase, retrieves the namespaces and tables, with their structures, and creates the associated Metadata structure.

To reverse your items simply right click on the desired node and choose Actions > Reverse [All]

  • On the server node: all the namespaces and corresponding tables will be reversed

  • On a namespace node: all the tables of the namespace will be reversed

  • On a table node: only the structure of the selected table is reversed

Below an axample on the server, to reverse everything:

getting started hbase reverse all

You can use the "Namespace Filter" property of the server node if you want to filter the namespaces to reverse

Metadata Additional Details

Namespace node

The namespace node is the container for tables, and physically exists on the HBase server.

You can use the right click > Actions > Reverse menu to reverse the tables of the selected namespace

getting started hbase namespace node example

The following properties are available:

Property

Description

Example

Name

Logical label used in Semarchy xDM Data Integration to identify the namespace node

namespace01

Physical name

Real name on the HBase server

namespace01

Table Filter

Optional regular expression used to filter the tables to reverse when performing a reverse on this namespace.

The '%' character can be used to define 'any character'.

*

table%

myTable_%

Table node

The table node represents an HBase table, which contains families, containing themselves columns.

You can use the right click > Actions > Reverse menu to reverse automatically the structure of the selected table

getting started hbase table node example

The following properties are available on table node:

Property

Description

Example

Name

Logical label used in Semarchy xDM Data Integration to identify the table node

table01

Physical name

Real name on the HBase server

table01

Family Node

The following properties are available on a family:

Property

Description

Example

Name

Logical label used in Semarchy xDM Data Integration to identify the family node

family01

Physical name

Real name on the HBase server

family01

Column node

Row Key

In HBase, every row is uniquely identified by a 'Row Key', which is a special field associated to each row inserted.

We decided in Semarchy xDM Data Integration to represent it as a specific column, to offer the possibility to fill its value at write and read.

This special column is created automatically at reverse by Semarchy xDM Data Integration, and is represented with a little key icon

getting started hbase row key node

If needed, you can also create it manually, this is simply a column which must be named 'row_key', and in which the Advanced/Is Row Key property is checked.

Note that only one column should be set as Row Key, and we advise to let Semarchy xDM Data Integration create it automatically at reverse to avoid any issue.

Column

In HBase the columns contained in a row are not pre-defined.

Meaning that there can be different number of columns from row to row, which can be completely different.

When reversing the columns, StamSemarchy xDM Data Integrationbia scans the number of rows specified on the server node to find all the columns available in these.

As a consequence, the reversed Metadata might not contain all the existing columns, depending if they were present in the scanned rows or not.

If needed, you can easily add new columns manually in the Metadata with a right click > New Column menu on a family.

The columns are dynamically managed by HBase when inserting data.

Property

Description

Example

Name

Logical label used in Semarchy xDM Data Integration to identify the column node

column01

Physical name

Real name on the HBase server

column01

Type

Datatype representing the data contained in this column.

Note that HBase stores everything as bytes and doesn’t have a notion of datatypes.

The datatypes specified in Semarchy xDM Data Integration helps to define how to manipulate data when reading from HBase.

Refer to Datatypes Management for more information.

If not set the 'Default DataType' specified on the server node is used.

string

Precision

Precision (size) of the data contained in this column for the specified type.

255

Scale

Scale to use for the specified datatype.

It represents the number of digits for numeric datatypes.

10

Datatypes Management

HBase stores everything as bytes in its storage system and it doesn’t have a notion of datatypes for the columns.

We decided in Semarchy xDM Data Integration to offer the possibility to specify datatypes on the columns to ease data manipulation between different technologies.

This helps Semarchy xDM Data Integration to have an idea of what is contained in a column, to treat it correctly when reading data from HBase.

This helps also to define the correct SQL types and size in the temporary objects created and manipulated when loading data from HBase into other database systems such as Teradata, Oracle, Microsoft SQL Server…​

HBase importtsv tool

HBase is usually shipped with a native command line tool called 'importtsv'.

This tool is used to load data from HDFS into HBase and is optimized for having good performances.

The Semarchy xDM Data Integration Templates offer the possibility to use this method to load data into HBase.

For more information, please refer to HBase importtsv Tool page which explains how to configure the Metadata and Mappings to use it.

Create your first Mappings

Your Metadata being ready and your tables reversed, you can now start creating your first Mappings.

For the basics HBase is not different than any other database you could usually use in Semarchy xDM Data Integration.

Drag and drop your source and target and map the columns as usual.

Example of Mapping loading data from HSQL to HBase:

getting started hbase mapping a

Example of Mapping loading data from HBase to HSQL, with a filter on HBase:

getting started hbase mapping b

Filters must use the HBase filter syntax, such as SingleColumnValueFilter ('family01', 'column02', = , 'binary:gibbs'). Refer to the HBase documentation and examples in the Demonstration Project for further information.

Example of a join between two HBase tables to load a target HSQL table:

getting started hbase mapping c

The join order defined with the Left Part / Right Part of the properties is important.

Notice the little triangle on the joined table link (table02) in the Mapping.

In this example we are retrieving all the table01 rows from which the Row Key exists in the table02.

You cannot use the columns of the joined table (table02) in the target table, the joined table is only used as a 'lookup' table.

Sample Project

The Hadoop Component ships sample project(s) that contain various examples and use cases.

You can have a look at these projects to find samples and examples describing how to use it.

Refer to Install Components to learn how to import sample projects.