IBM watsonx.data FAQs

The following IBM watsonx data frequently asked questions and answers provide you with general and frequently used or required installation, configuration, and replication-related information.

watsonx.data enables enterprises to seamlessly expand their analytics and AI capabilities by leveraging a purpose-built data store. This data store is built on an open lakehouse architecture, incorporating robust querying, governance, and open data formats to facilitate efficient data access and sharing.

Why your Business need watsonx.data

IBM watsonx.data is the next-generation data lakehouse, standing out as the industry’s only open, hybrid, and governed data repository. It enables you to leverage various query engines for analytics and AI tasks across multiple locations.

When integrated with IBM watsonx.data, the semantic layer will provide data enrichments that allow clients to interpret and navigate complex, structured data using natural language through semantic search. This innovation will accelerate data discovery and unlock insights more quickly—no SQL knowledge required.

Previous Topic

Installing IBM Maximo APM - Asset Health Insights

Parent Topic

What's new in IBM Maximo APM - Asset Health Insights 7.6.1

Next Topic

Browse Categories

Share Blog Post

Pragma Edge (4.5/5)

 4.5/5

Published on July 30, 2024

Read the following PragmaEdge IBM watsonx.data frequently asked questions and answers.

What is a data lakehouse?

A data lakehouse combines the best features of a data lake and a data warehouse. At its core it enables a customer to

Store: Leverage object storage – for disruptively affordable data storage
Format: Open data and table formats to allow interoperability and
Query: open source query engines to query and make sense of the data on demand
Governance: share data with who needs it and has permissions, and not with everyone

Why did we choose the technologies we did as the foundation of watsonx.data ?

Storage: S3 API based object storage is a de facto storage standard for affordable storage.
- Public Cloud: based on deployment, AWS S3 or IBM Cloud Object Storage/CoS) can provide the cloud-scale storage for watsonx.data at an affordable cost with control over the data
- On-premises: IBM Storage Ceph can provide an open, scalable and software defined object storage. It is engineered with resiliency so there is no single point of failure, a critical differentiator as IBM Storage Ceph is optimized for enterprise data storage at petabyte-scale.
Open Data and Table Formats:
- Table: Iceberg is the open table format with market leading adoption. It brings ACID transaction support, massive scale and speed and critically, upon our architects careful review of the market and comparisons (e.g. to Delta Lake and HUDI) it became clear this is the direction the market and community is embracing first, so we will too!
- Data formats: Parquet and ORC are the most common and adopted open formats
Query: Multi engine- there is NO one engine to rule them all, so IBM will be first to market with multi-engine lakehouse support built in!
- Presto (Ahana-enhanced) for BI purposes, SQL and dozens of connectors – allows for speed and efficiency – this was also chosen upon our architects careful review of the market and comparisons (e.g. Trino) as a more open open-source engine with a promising future and an exciting roadmap we can collaborate on
- Spark: for AI/ML workloads to make unstructured data from the data lake useful
- IBM Data Warehouse engines: run the highly tuned IBM DW engines on your IBM Data warehouse for the best tuning and high speed responses when needed
Governance:
- Built in: the basics of Role Based Access Control (RBAC) and
- Watson Knowledge Catalog (WKC): highly customize able and powerful governance through our integration with WKC (sold separately)
Hybrid Cloud /private cloud / on premises / on-prem software deployments
- SaaS on AWS and IBM Cloud
- On-prem or customer managed environment on OpenShift Container Platform (OCP) as a cartridge on Cloud Pak for Data CP4D)

What is core value prop messaging we all should be using, or is it just "cost optimization"?

Cost optimization: save 50% of costs by choosing the right engine for the right workload and dynamically pausing/resuming engines
Return on data: by enabling customers to get easy access to all their data – existing and new and across hybrid cloud environments, and leveraging cost-efficient object storage.
Faster time to value: by reducing data movement and ETL, data onboarding and pipelining lead times, built-in security and governance through integration with data fabric

What are key functionalities we should be showing customers?

Demonstrate Db2 Warehouse & Netezza workload offloading for customers modernizing their on-prem warehouse appliances to SaaS
Demonstrate data sharing between different engines for customers looking to enhance data lake analytics and achieve broader workload coverage

What is expected in GA?

The ability to quickly get started and start analyzing data
The ability to share/bring an S3 object storage bucket
The ability to query these with the Presto engine
The ability to create/start/stop multiple Presto engines
The ability to query these with the IBM Spark engine by connecting to an external spark engine (e.g. IBM Analytics Engine or Apache Spark on Amazon EMR) – with full inclusion of Spark internal to watsonx.data in a future release
The integration points (connectors) between Netezza and Db2 Warehouse in SaaS
Metadata synch between watsonx.data and Db2 Warehouse and Netezza
Caching powered by Raptor X
Deploy SaaS on AWS or IBM Cloud
Deploy software on-prem on OCP standalone or with CP4D

Does watsonx.data include licensing to use CP4D? OCP? Does it require them?

Yes, as a cartridge on CP4D watsonx.data requires using both CP4D and OCP underneath the install and includes limited licensing to each specifically to deploy and use watsonx.data only. Separate entitlements to CP4D can be purchased to deploy and use other value added CP4D services alongside that work well with watsonx.data like Watson Knowledge Catalog (WKC) or IBM Analytics Engine (based on Apache Spark)

How do we understand the visionary demonstrations (like at THINK) the indicate product direction vs what is in the product at GA or other given times?

Our demos at think are intended to show what will be available in the product by end of year there are generally 3 levels of feature access
1. Private Preview: special arrangements for select customers to access features
2. Open preview: available to try to any customer but not supported at production level
3. Generally Available: GA fully supported and available to all to buy
We will use our roadmap on seismic to identify major items demonstrated at THINK that are not yet available in GA and when they are targeted for in our Continuous Delivery (CD) release cadence
All of these features and timings should be considered directional as per IBM disclosures, disclaimers and NDAs

Can you list scenarios and use cases where client will be interested by on-premises lakehouse as most factors like cost saving might not be significant on-premises ?

For on-prem scenarios, it’s not only about costs, it’s also regulations and a need to keep data on-prem. IBM is one of the few vendors (only vendor?) who can provide a true hybrid cloud story in the lakehouse space today
Modernization is important, irrespective of on-prem or cloud. The reality is every client wants off Hadoop. Today, the leading option has been Databricks – who only exists on cloud. Many clients have PBs of data in their on premise data lake, and are unwilling to migrate that to cloud. Our lakehouse is designed to augment or modernize existing data lakes. We can simply point lakehouse to an existing Hadoop cluster and start querying right away.

Does the watsonx.data lakehouse offer a connector for Cloudera?

Hive access will be available at GA.
HDFS support is in progress but will not land at GA.
Impala access is not currently in roadmap.
HBase access is not currently in roadmap.

What is the migration effort required by on-premise Db2 Warehouse, NPS and Cloudera (Hive) deployment to watsonx.data ?

The idea is not to migrate data. Since the GA of watsonx.data, Db2WH and Netezza now also support Hive Metastore, Iceberg Table format and storage on S3 object storage formats

How do I perform SQL DDL or DML on an Iceberg Table?

Iceberg is not a query engine and thus does not have any SQL functionality by itself. To perform DML or DDL operations on Iceberg data you need to use a supporting query engine like Presto or Spark which support them.

Databricks use Deltalake format, will we support this either in watsonx.data or with Db2 WH/Netezza?

Deltalake is being considered for the longer-term roadmap. Even though they have claimed open source, Delta Lake is decidedly “closed governance” it is still a project dominated by Databricks and there have been issues getting changes into projects owned by Databricks, Spark as an example.

Is there a plan to support other tables formats or catalogs? Why did we choose Iceberg?

There are 2 other popular open table formats today: Delta Lake and Apache Hudi: with Delta Lake, although it is claimed to be Opensource, it is very “closed governance” Databricks control a lot of it. Another Table format that was considered by development was Apache Hudi which might become supported down the line depending on its usage by the community.
Others being considered include informatica catalogs, Azure ADLS Gen 2 and others being considered in accordance with market demand

How will watsonx.data integrate with Identity and Access Management (IAM) solutions?

When deployed on SaaS IBM Cloud IAM will be integrated.
When deployed in software there are plans for integrations to underlying identity providers – with the option of more advanced data governance from Watson Knowledge Catalog (bought separately in CP4D) for customers that need it.

Ask Experts!

Can’t Find The Answer You’re Looking For?
Don’t Worry We’re Here To Help! Please Submit A Question

Thank you for registering for the conference ! Our team will confirm your registration shortly.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Next Topic

Expert resources to help you succeed

Ask Experts!

Industries

Products

Who We Are

IBM Partner Engagement Manager Standard

IBM Partner Engagement Manager Standard

IBM Partner Engagement Manager Standard

Pragma Edge - API Connect