Posted by Marbenz Antonio on March 1, 2023
Specifically, it elucidates the function of table metadata and how the IBM Cloud Data Engine service provides this crucial element for your data lake.
There is no revelation that metadata is an important component that requires management in data and analytics solutions. This topic is often linked with data governance, and rightly so, as this type of metadata guarantees effortless discoverability, safeguarding of data, and tracking of data lineage.
Nevertheless, metadata encompasses more than just data governance, as it also encompasses what is known as technical metadata. This refers to information about a dataset’s schema, data types, and statistical details regarding the values in each column. Technical metadata is especially important when discussing data lakes because unlike integrated database repositories like RDBMS that have built-in technical metadata, it is a separate component in a data lake that requires explicit setup and maintenance.
Usually, this component is known as the meta store or table catalog. It comprises technical details regarding your data that are necessary to build and run analytical queries, particularly SQL statements.
The growing adoption of data lakehouse technology is driving technical metadata to be partly collocated and stored alongside the data itself in combined table formats such as Iceberg and Delta Lake. However, this does not negate the requirement for a centralized and dedicated meta store component since table formats can only manage metadata at the table level. Data is usually stored across multiple tables in a more or less complex table schema, which may also include details about referential relationships between tables or logical data models referred to as views.
To ensure optimal performance, a metastore component or service is essential in every data lake. The Hive Metastore is the most commonly used metastore interface, which is supported by a wide range of big data processing engines and libraries. Despite its origins in the Hadoop ecosystem, it is no longer limited to or reliant on Hadoop and is often utilized in Hadoop-free environments, such as in cloud-based data lake solutions.
The metadata stored in a Hive Metastore is equally vital as the actual data in the data lake and should be treated with the same level of importance. Therefore, it’s crucial to ensure the persistence and high availability of the megastore’s metadata and include it in any disaster recovery plan.
As part of our continuous efforts to enhance IBM Cloud’s native data lake capabilities, we introduced the IBM Cloud Data Engine in May 2022. This addition builds upon our existing serverless SQL processing service, previously referred to as IBM Cloud SQL Query, by incorporating a fully managed Hive Metastore functionality.
Each instance of IBM Cloud Data Engine is now a dedicated namespace and instance of a Hive Metastore, providing the ability to manage, configure, and store metadata related to your table and data model across all of your data lake data on IBM Cloud Object Storage. You can be assured that the Hive Metastore data is always available, as it is integrated into the Data Engine service itself. Additionally, the serverless model applies to the Hive Metastore, meaning that you are only charged for the actual requests made, without any fixed costs for having a Data Engine instance with its own metadata in the Hive Metastore.
This integration seamlessly incorporates the serverless SQL-based functions for data ingestion, data transformation, and analytical querying that IBM Cloud Data Engine inherits from the IBM Cloud SQL Query service.
In addition, Data Engine can now function as a Hive Metastore, enabling it to integrate with other big data runtimes that are deployed and provisioned elsewhere. For example, you can connect the Spark runtime services in IBM Cloud Pak for Data with IBM Watson Studio or IBM Analytics Engine to your Data Engine instance as the Hive Metastore that serves as a relational table catalog for your Spark SQL jobs. The diagram below provides a visual representation of this architecture.
Utilizing Data Engine as your table catalog is a straightforward process when leveraging the pre-existing Spark runtime services in IBM Cloud and IBM Cloud Pak for Data, as the necessary connectors to Data Engine’s Hive Metastore are already integrated out-of-the-box. The following PySpark code can be used to configure a SparkSession object to work with your specific instance of IBM Data Engine:
With the SparkSession object configured, you can proceed to use it as normal, such as retrieving a list of the currently defined tables and executing SQL statements that query these tables.
If you are managing your own Spark runtimes, you can still utilize the same mechanisms outlined above. However, before proceeding, you must first establish the connector libraries for Data Engine within your Spark environment.
For additional information, we recommend consulting the Hive Metastore documentation for Data Engine. Furthermore, their Data Engine demo notebook is also available for download and use in your own Jupyter notebook environment or within the Watson Studio notebook service in Cloud Pak for Data.
Chapter 10 of the notebook contains a comprehensive setup and usage demonstration for utilizing Spark with Hive Metastore in Data Engine. Additionally, a brief demo of this notebook can be found at the 14:35 minute mark in the previously mentioned demo video for the “Modernize your Big Data Analytics with Data Lakehouse in IBM Cloud” webinar.
This article describes the new Hive Metastore as a Service capability in IBM Cloud, which provides a central component for building modern data lakes in IBM Cloud without the need for Day 1 setup or Day 2 operational overhead. To get started, simply provision an IBM Cloud Object Storage instance for your data and a Data Engine instance for your metadata to create a serverless, cloud-native data lake. From there, you can begin ingesting, preparing, curating, and using your data with the Data Engine service itself, or with your custom Spark applications, Analytics Engine service, Spark runtimes in Watson Studio, or any other custom Spark runtime that is connected to the same data on Object Storage and the same metadata in Data Engine.
Here at CourseMonster, we know how hard it may be to find the right time and funds for training. We provide effective training programs that enable you to select the training option that best meets the demands of your company.
For more information, please get in touch with one of our course advisers today or contact us at firstname.lastname@example.org