man02

The following IBM watsonx data frequently asked questions and answers provide you with general and frequently used or required installation, configuration, and replication-related information.

watsonx.data enables enterprises to seamlessly expand their analytics and AI capabilities by leveraging a purpose-built data store. This data store is built on an open lakehouse architecture, incorporating robust querying, governance, and open data formats to facilitate efficient data access and sharing.

Why your Business need watsonx.data

IBM watsonx.data is the next-generation data lakehouse, standing out as the industry’s only open, hybrid, and governed data repository. It enables you to leverage various query engines for analytics and AI tasks across multiple locations.

When integrated with IBM watsonx.data, the semantic layer will provide data enrichments that allow clients to interpret and navigate complex, structured data using natural language through semantic search. This innovation will accelerate data discovery and unlock insights more quickly—no SQL knowledge required.

Previous Topic

Installing IBM Maximo APM - Asset Health Insights

Parent Topic

What's new in IBM Maximo APM - Asset Health Insights 7.6.1

Next Topic

Read the following PragmaEdge IBM watsonx.data frequently asked questions and answers.

A data lakehouse combines the best features of a data lake and a data warehouse. At its core it enables a customer to

  1. Store: Leverage object storage – for disruptively affordable data storage
  2. Format: Open data and table formats to allow interoperability and
  3. Query: open source query engines to query and make sense of the data on demand
  4. Governance: share data with who needs it and has permissions, and not with everyone
  1. Storage: S3 API based object storage is a de facto storage standard for affordable storage.
    • Public Cloud: based on deployment, AWS S3 or IBM Cloud Object Storage/CoS) can provide the cloud-scale storage for watsonx.data at an affordable cost with control over the data
    • On-premises: IBM Storage Ceph can provide an open, scalable and software defined object storage. It is engineered with resiliency so there is no single point of failure, a critical differentiator as IBM Storage Ceph is optimized for enterprise data storage at petabyte-scale.
  2. Open Data and Table Formats:
    • Table: Iceberg is the open table format with market leading adoption. It brings ACID transaction support, massive scale and speed and critically, upon our architects careful review of the market and comparisons (e.g. to Delta Lake and HUDI) it became clear this is the direction the market and community is embracing first, so we will too!
    • Data formats: Parquet and ORC are the most common and adopted open formats
  3. Query: Multi engine- there is NO one engine to rule them all, so IBM will be first to market with multi-engine lakehouse support built in!
    • Presto (Ahana-enhanced) for BI purposes, SQL and dozens of connectors – allows for speed and efficiency – this was also chosen upon our architects careful review of the market and comparisons (e.g. Trino) as a more open open-source engine with a promising future and an exciting roadmap we can collaborate on
    • Spark: for AI/ML workloads to make unstructured data from the data lake useful
    • IBM Data Warehouse engines: run the highly tuned IBM DW engines on your IBM Data warehouse for the best tuning and high speed responses when needed
  4. Governance:
    • Built in: the basics of Role Based Access Control (RBAC) and
    • Watson Knowledge Catalog (WKC): highly customize able and powerful governance through our integration with WKC (sold separately)
  5. Hybrid Cloud /private cloud / on premises / on-prem software deployments
    • SaaS on AWS and IBM Cloud
    • On-prem or customer managed environment on OpenShift Container Platform (OCP) as a cartridge on Cloud Pak for Data CP4D)
  1. Cost optimization: save 50% of costs by choosing the right engine for the right workload and dynamically pausing/resuming engines
  2. Return on data: by enabling customers to get easy access to all their data – existing and new and across hybrid cloud environments, and leveraging cost-efficient object storage.
  3. Faster time to value: by reducing data movement and ETL, data onboarding and pipelining lead times, built-in security and governance through integration with data fabric
  1. Demonstrate Db2 Warehouse & Netezza workload offloading for customers modernizing their on-prem warehouse appliances to SaaS
  2. Demonstrate data sharing between different engines for customers looking to enhance data lake analytics and achieve broader workload coverage
  1. The ability to quickly get started and start analyzing data
  2. The ability to share/bring an S3 object storage bucket
  3. The ability to query these with the Presto engine
  4. The ability to create/start/stop multiple Presto engines
  5. The ability to query these with the IBM Spark engine by connecting to an external spark engine (e.g. IBM Analytics Engine or Apache Spark on Amazon EMR) – with full inclusion of Spark internal to watsonx.data in a future release
  6. The integration points (connectors) between Netezza and Db2 Warehouse in SaaS
  7. Metadata synch between watsonx.data and Db2 Warehouse and Netezza
  8. Caching powered by Raptor X
  9. Deploy SaaS on AWS or IBM Cloud
  10. Deploy software on-prem on OCP standalone or with CP4D

Yes, as a cartridge on CP4D watsonx.data requires using both CP4D and OCP underneath the install and includes limited licensing to each specifically to deploy and use watsonx.data only. Separate entitlements to CP4D can be purchased to deploy and use other value added CP4D services alongside that work well with watsonx.data like Watson Knowledge Catalog (WKC) or IBM Analytics Engine (based on Apache Spark)

  1. Our demos at think are intended to show what will be available in the product by end of year there are generally 3 levels of feature access
    1. Private Preview: special arrangements for select customers to access features
    2. Open preview: available to try to any customer but not supported at production level
    3. Generally Available: GA fully supported and available to all to buy
  2. We will use our roadmap on seismic to identify major items demonstrated at THINK that are not yet available in GA and when they are targeted for in our Continuous Delivery (CD) release cadence
  3. All of these features and timings should be considered directional as per IBM disclosures, disclaimers and NDAs
  1. For on-prem scenarios, it’s not only about costs, it’s also regulations and a need to keep data on-prem. IBM is one of the few vendors (only vendor?) who can provide a true hybrid cloud story in the lakehouse space today

  2. Modernization is important, irrespective of on-prem or cloud. The reality is every client wants off Hadoop. Today, the leading option has been Databricks – who only exists on cloud. Many clients have PBs of data in their on premise data lake, and are unwilling to migrate that to cloud. Our lakehouse is designed to augment or modernize existing data lakes. We can simply point lakehouse to an existing Hadoop cluster and start querying right away.
  1. Hive access will be available at GA.
  2. HDFS support is in progress but will not land at GA.
  3. Impala access is not currently in roadmap.
  4. HBase access is not currently in roadmap.

The idea is not to migrate data. Since the GA of watsonx.data, Db2WH and Netezza now also support Hive Metastore, Iceberg Table format and storage on S3 object storage formats

Iceberg is not a query engine and thus does not have any SQL functionality by itself. To perform DML or DDL operations on Iceberg data you need to use a supporting query engine like Presto or Spark which support them.

Deltalake is being considered for the longer-term roadmap. Even though they have claimed open source, Delta Lake is decidedly “closed governance” it is still a project dominated by Databricks and there have been issues getting changes into projects owned by Databricks, Spark as an example.

  • There are 2 other popular open table formats today: Delta Lake and Apache Hudi: with Delta Lake, although it is claimed to be Opensource, it is very “closed governance” Databricks control a lot of it. Another Table format that was considered by development was Apache Hudi which might become supported down the line depending on its usage by the community.
  • Others being considered include informatica catalogs, Azure ADLS Gen 2 and others being considered in accordance with market demand
  • When deployed on SaaS IBM Cloud IAM will be integrated.
  • When deployed in software there are plans for integrations to underlying identity providers – with the option of more advanced data governance from Watson Knowledge Catalog (bought separately in CP4D) for customers that need it.

Expert resources to help you succeed

Product Demo

Watch our top-notch
product demos

Services

We offer the full spectrum of services to help organizations work better.

Blog

Stay up to date on the latest technologies.

Ask Experts!

Can’t Find The Answer You’re Looking For?
Don’t Worry We’re Here To Help! Please Submit A Question​
Group01

Thank you for submitting your details.

For more information, Download the PDF.

Thank you for the Registration Request, Our team will confirm your request shortly.

Invite and share the event with your colleagues 

IBM Partner Engagement Manager Standard

IBM Partner Engagement Manager Standard is the right solution
addressing the following business challenges

IBM Partner Engagement Manager Standard

IBM Partner Engagement Manager Standard is the right solution
addressing the following business challenges

IBM Partner Engagement Manager Standard

IBM Partner Engagement Manager Standard is the right solution
addressing the following business challenges

Pragma Edge - API Connect