Enabling Hadoop and Spark

Introduction

The DataScience.com Platform provides seamless integration for Apache Hadoop, Hive, and Spark. The Platform will connect to your data where it lives, so there is no need to move data or add/replace costly infrastructure. To enable Hadoop, Hive, and Spark on your instance, you will need to follow a two-step process: (i) configure your Hadoop cluster and (ii) build your Hadoop-enabled environments.

Hadoop Cluster Configuration

To configure your cluster, navigate to Administration in the top menu bar and click on the Hadoop Cluster tab. Select your Hadoop provider from the dropdown list to begin.

../_images/Configure-hadoop-cluster.png

Fill in the form with the cluster name, provider version, and additional security information. Then, choose to enable Hadoop, Hive, and/or Spark by checking the respective boxes. When you choose to enable a certain framework, you will be prompted to add additional information in the form of configuration files that you will upload into the form. In this next section, you will learn how to locate and acquire these files.

Warning

Please gather configuration materials from edge nodes or client machines that have successfully connected to the cluster. Cluster server notes may not have the proper configurations.

Enabling Hadoop (Optional)

Optional Files

core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, hadoop-env.sh

These files can be found in HADOOP_CONF_DIR. It is usually symbolically linked to /etc/hadoop/conf but different distributions of Hadoop will lay them down in different places. Another common place for the actual files will be under the installation directory at HADOOP_HOME/etc/hadoop. If the files are empty or just the files with a .template suffix are present, it is not necessary to upload the files.

For more information, see Apache Hadoop’s documentation on Cluster Setup.

Note

Please note that some of the configuration options in portions of this documentation referring to services like HDFS NameNode, YARN NodeManager, etc. will not affect the operation of the cluster, since these services will not run inside a DataScience.com Hadoop-enabled environment. Instead, they will remain on the external Hadoop cluster that the environment will connect to.

As of yet, we don’t override any Hadoop settings in the files that you upload, but we may override the settings in the files you upload for other services to ensure your clients can connect from inside the DataScience.com Hadoop-enabled environment. These settings will be outlined below.

Enabling Hive (Optional)

Required Files

hive-site.xml

Optional Files

hive-env.sh

These files can be found in HIVE_CONF_DIR. It is usually symbolically linked to /etc/hive/conf, but different distributions of Hadoop will lay them down in different places. Another common place for the actual files will be under the installation directory at HIVE_HOME/conf.

For more information, see Apache Hive’s documentation on Configuring Hive.

Note

Please note that hive-site.xml may have the Hive Metastore password present if you’re connecting to Hive without any additional security authentication mechanism in place or if you take it from the server hosting HiveServer(2).

We may override the following properties in hive-site.xml:

  • hive.execution.engine
  • hive.metastore.schema.verification
  • hive.metastore.sasl.enabled
  • hive.exec.scratchdir

Tez

If Tez is enabled, we don’t currently support uploading Tez-specific configuration files. Instead, at runtime we inject the appropriate configuration and properties to make sure the Tez jars are available on the HADOOP_CLASSPATH set in hadoop-env.sh, and that the hive.execution.engine is set properly in hive-site.xml.

Enabling Spark

Required Files

spark-defaults.conf, spark-env.sh

These files can be found in $SPARK_HOME/conf. It is usually symbolically linked to /etc/spark/conf, but different distributions of Hadoop or an end user installation of Spark may or may not have set this up. The below files may not be actually be present in the configuration directory and only files with the .template suffix exist. In this case, it is not necessary to upload them to the Environments UI.

For more information, see Apache Spark’s documentation on Spark Properties and Environment Variables.

We override the following values in spark-defaults.conf:

  • spark.driver.extraJavaOptions
  • spark.executor.extraJavaOptions
  • spark.blockManager.port
  • spark.driver.port
  • spark.driver.blockManager.port
  • spark.driver.bindAddress
  • spark.driver.host
  • spark.sql.warehouse.dir

Warning

Since Hive and Spark are on different development cycles, when Spark integrates with Hive on a Hadoop cluster, it commonly uses a different configuration and is even packaged with a different version of Hive than is installed on the cluster. If that is the case, it is necessary to upload a hive-site.xml specifically for Spark. This should be found in $SPARK_HOME/conf as well.

Building a Hadoop-Enabled Environment

Build a Hadoop-Enabled Environment

Building a Hadoop-enabled environment is similar to building any Base environment, except all of the Dockerfile commands for installation are form-field driven. Simply navigate to the Environments screen, select Add Environment > Base Environment, and choose the Install Hadoop Dependencies option. Once selected, choose your Hadoop distribution as your provider, select your version, and build.

../_images/Build-a-hadoop-enabled-environment.png

Currently, the DataScience.com Platform supports MapR versions 5.2.1 and 5.2.2 as well as Cloudera version 5.10.

For more information about building environments, see our Environment Management documentation.

After you have created an available Hadoop-enabled Base environment, create a User environment from this new Base environment to enable your users to connect to the cluster.

MapR Ticketing

The DataScience.com Platform supports the use of user-level MapR Tickets to authenticate against your MapR cluster. There is full support for MapR Ticket authentication across the Platform, including interactive sessions and ad hoc and scheduled runs. To enable MapR Ticket authentication, select it as the Security option in the Cluster Configuration setup. Any Users who want to authenticate to the cluster will need to upload their personal MapR Ticket to the Platform. See the Account Setup documentation for more information.

Kerberos Authentication

The DataScience.com Platform supports Kerberos authentication to your Hadoop cluster for User workloads across the Platform including interactive sessions and ad hoc and scheduled runs. To enable Kerberos authentication, select it as the Security option in the Cluster Configuration setup. Any Users who want to authenticate to the cluster will need to upload their personal Kerberos credentials to the Platform. See the Account Setup documentation for more information.

Other Providers

Coming soon! First class support for EMR and Hortonworks are in development. Don’t see your provider? Contact success@datascience.com.