Skip to content

IBM Metastore aaS: No Lake Without Metadata

What is a Data Lake? | TIBCO Software

Discovering the enhanced functionality of IBM Cloud for constructing and overseeing cloud data lakes using IBM Cloud Object Storage.

Specifically, it elucidates the function of table metadata and how the IBM Metastore aaS Cloud Data Engine service provides this crucial element for your data lake.

There is no revelation that metadata is an important component that requires management in data and analytics solutions. This topic is often linked with data governance, and rightly so, as this type of metadata guarantees effortless discoverability, safeguarding of data, and tracking of data lineage.

Nevertheless, metadata encompasses more than just data governance, as it also encompasses what is known as technical metadata. This refers to information about a dataset’s schema, data types, and statistical details regarding the values in each column. Technical metadata is especially important when discussing data lakes because unlike integrated database repositories like RDBMS that have built-in technical metadata, it is a separate component in a data lake that requires explicit setup and maintenance.

Usually, this component is known as the meta store or table catalog. It comprises technical details regarding your data that are necessary to build and run analytical queries, particularly SQL statements.

The growing adoption of data lakehouse technology is driving technical metadata to be partly collocated and stored alongside the data itself in combined table formats such as Iceberg and Delta Lake. However, this does not negate the requirement for a centralized and dedicated meta store component since table formats can only manage metadata at the table level. Data is usually stored across multiple tables in a more or less complex table schema, which may also include details about referential relationships between tables or logical data models referred to as views.

To ensure optimal performance, a metastore component or service is essential in every data lake. The Hive Metastore is the most commonly used metastore interface, which is supported by a wide range of big data processing engines and libraries. Despite its origins in the Hadoop ecosystem, it is no longer limited to or reliant on Hadoop and is often utilized in Hadoop-free environments, such as in cloud-based data lake solutions.

The metadata stored in a Hive Metastore is equally vital as the actual data in the data lake and should be treated with the same level of importance. Therefore, it’s crucial to ensure the persistence and high availability of the megastore’s metadata and include it in any disaster recovery plan.

IBM Metastore aaS launches IBM Cloud Data Engine

IBM launches IBM Cloud Data Engine

As part of our continuous efforts to enhance IBM Cloud’s native data lake capabilities, we introduced the IBM Cloud Data Engine in May 2022. This addition builds upon our existing serverless SQL processing service, previously referred to as IBM Cloud SQL Query, by incorporating a fully managed Hive Metastore functionality.

Each instance of IBM Cloud Data Engine is now a dedicated namespace and instance of a Hive Metastore, providing the ability to manage, configure, and store metadata related to your table and data model across all of your data lake data on IBM Cloud Object Storage. You can be assured that the Hive Metastore data is always available, as it is integrated into the Data Engine service itself. Additionally, the serverless model applies to the Hive IBM Metastore aaS, meaning that you are only charged for the actual requests made, without any fixed costs for having a Data Engine instance with its own metadata in the Hive Metastore.

This integration seamlessly incorporates the serverless SQL-based functions for data ingestion, data transformation, and analytical querying that IBM Cloud Data Engine inherits from the IBM Cloud SQL Query service.

This seamlessly integrates with the serverless SQL-based data ingestion, data transformation and analytic query functions that IBM Cloud Data Engine inherits from the IBM Cloud SQL Query service:

In addition, Data Engine can now function as a Hive Metastore, enabling it to integrate with other big data runtimes that are deployed and provisioned elsewhere. For example, you can connect the Spark runtime services in IBM Cloud Pak for Data with IBM Watson Studio or IBM Analytics Engine to your Data Engine instance as the Hive Metastore that serves as a relational table catalog for your Spark SQL jobs. The diagram below provides a visual representation of this architecture.

Want to know more about IBM Metastore aaS? Visit our course now.

The following diagram visualizes this architecture:

Using Data Engine with Spark aaS in IBM Metastore Cloud

Utilizing Data Engine as your table catalog is a straightforward process when leveraging the pre-existing Spark runtime services in IBM Cloud and IBM Cloud Pak for Data, as the necessary connectors to Data Engine’s Hive Metastore are already integrated out-of-the-box. The following PySpark code can be used to configure a SparkSession object to work with your specific instance of IBM Data Engine:

instancecrn = <your Data Engine instance ID>
apikey = <your API key to access your Data Engine instance>
from dataengine import SparkSessionWithDataengine
session_builder = SparkSessionWithDataengine.enableDataengine(instancecrn, apikey)
spark = session_builder.appName( "My Spark App" ).getOrCreate()

With the SparkSession object configured, you can proceed to use it as normal, such as retrieving a list of the currently defined tables and executing SQL statements that query these tables.

spark.sql( 'show tables' ).show()
spark.sql( 'select count(*), country from my_customers group by country' ).show()

IBM MetaStore aaS: Using Data Engine with your custom Spark deployments

If you are managing your own Spark runtimes, you can still utilize the same mechanisms outlined above. However, before proceeding, you must first establish the connector libraries for Data Engine within your Spark environment.

Install the Data Engine SparkSession builder

  1. Download the jar file for the SparkSession builder and place it in a folder in the classpath of your Spark installation (normally you should use the folder “user-libs/spark2”).
  2. Download the Python library to a local directory on the machine of your Spark installation and install it with pip:
pip install --force-reinstall <download dir>/dataengine_spark-1.0.10-py3-none-any.whl

Install and activate the Data Engine Hive client library

  1. Download the Hive client from this link and store it in a directory on your machine where you run Spark.
  2. Specify that directory name as an additional parameter when building the SparkSession with Data Engine as the catalog:
session_builder = SparkSessionWithDataengine.enableDataengine(instancecrn, apikey, pathToHiveMetastoreJars=<directory name  with  hive client>)

For additional information, we recommend consulting the Hive Metastore documentation for Data Engine. Furthermore, their Data Engine demo notebook is also available for download and use in your own Jupyter notebook environment or within the Watson Studio notebook service in Cloud Pak for Data.

Chapter 10 of the notebook contains a comprehensive setup and usage demonstration for utilizing Spark with Hive Metastore in Data Engine. Additionally, a brief demo of this notebook can be found at the 14:35 minute mark in the previously mentioned demo video for the “Modernize your Big Data Analytics with Data Lakehouse in IBM Cloud” webinar.

Conclusion

This article describes the new Hive Metastore as a Service capability in IBM Cloud, which provides a central component for building modern data lakes in IBM Cloud without the need for Day 1 setup or Day 2 operational overhead. To get started, simply provision an IBM Cloud Object Storage instance for your data and a Data Engine instance for your metadata to create a serverless, cloud-native data lake. From there, you can begin ingesting, preparing, curating, and using your data with the Data Engine service itself, or with your custom Spark applications, Analytics Engine service, Spark runtimes in Watson Studio, or any other custom Spark runtime that is connected to the same data on Object Storage and the same metadata in Data Engine.


Here at CourseMonster, we know how hard it may be to find the right time and funds for training. We provide effective training programs that enable you to select the training option that best meets the demands of your company.

For more information, please get in touch with one of our course advisers today or contact us at training@coursemonster.com