Frequently asked Questions

The following Hadoop frequently asked questions and answers provide you with general and frequently used or required installation, configuration, and replication-related information.

Apache Hadoop

The emergence of Hadoop has changed the data landscape. with Hadoop, you can gain new or improved business insights from structured, unstructured, and semi-structured data sources. Large volumes of data can were stored historically or present in siloed departments can be gathered and analyzed in one place at an affordable price. It has highly reliable, scalable, distributed processing of large data sets using simple programming models.

Read the following Hadoop frequently asked questions and answers.

Big data and Hadoop projects depend on collecting, moving, transforming, cleansing, integrating, governing, exploring, and analyzing massive volumes of different types of data from many different sources. Accomplishing all this requires a resilient, end-to-end information integration solution that is massively scalable and provides the infrastructure, capabilities, processes, and discipline required to support Hadoop projects. 

Hadoop supports advanced analytics for stored data (e.g., predictive analysis, data mining, machine learning (ML), etc.). It enables big data analytics processing tasks to be split into smaller tasks. The small tasks are performed in parallel by using an algorithm (e.g., MapReduce), and are then distributed across a Hadoop cluster (i.e., nodes that perform parallel computations on big data sets).

The Hadoop ecosystem consists of four primary modules:

  • Hadoop Distributed File System (HDFS): Primary data storage system that manages large data sets running on commodity hardware. It also provides high-throughput data access and high fault tolerance.
  • Yet Another Resource Negotiator (YARN): Cluster resource manager that schedules tasks and allocates resources (e.g., CPU and memory) to applications.
  • Hadoop MapReduce: Splits big data processing tasks into smaller ones, distributes the small tasks across different nodes, then runs each task.
  • Hadoop Common (Hadoop Core): Set of common libraries and utilities that the other three modules depend on.

Though Hadoop management is difficult at the higher levels, there are many graphical user interfaces (GUIs) that simplify programming for MapReduce.

Hadoop is most effective for scenarios that involve the following:

  • Processing big data sets in environments where data size exceeds available memory
  • Batch processing with tasks that exploit disk read and write operations
  • Building data analysis infrastructure with a limited budget
  • Completing jobs that are not time-sensitive
  • Historical and archive data analysis

The Execution Engine for Apache Hadoop includes:

  • Services that establish secure connections between Watson Studio and Hadoop
  • Integration with Hadoop for Refinery and Notebook
  • A high availability configuration to the remote Hadoop system
  • Utilities that connect Watson Studio and Hadoop

The service requires a service user who has the necessary privileges to submit requests on behalf of the Watson Studio users to WebHDFS, WebHCAT, Spark, and YARN. The service generates a secure URL for each Watson Studio cluster that is integrated with the Hadoop cluster.

The Execution Engine for Apache Hadoop environments is not available by default. An administrator must install the Execution Engine for Apache Hadoop service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

Hadoop platforms comprise two primary components: a distributed, fault-tolerant file system called the Hadoop Distributed File System (HDFS), and a parallel processing framework called MapReduce.

The HDFS platform is very good at processing large sequential operations, where a “slice” of data read is often 64 MB or 128 MB. Generally, HDFS files are not partitioned or ordered unless the application loading the data manages this. Even if the application can partition and order the resulting data slices, there is no way to guarantee where that slice will be placed in the HDFS system. This means there is no good way to manage data collocation in this environment. Data collocation is critical because it ensures data with the same join keys winds up on the same nodes, and therefore the process is both high-performing and accurate. 

  • A Hadoop distribution
  • A shared-nothing, massively scalable ETL platform (such as the one offered by IBM InfoSphere Information Server) 
  • ETL pushdown capability into MapReduce

These components are required for MapReduce because a large percentage of data integration logic cannot be pushed into MapReduce without hand coding and because MapReduce has known performance limitations.