Enabling Hadoop and Spark¶
The DataScience.com Platform provides seamless integration for Apache Hadoop, Hive, and Spark. The Platform will connect to your data where it lives, so there is no need to move data or add/replace costly infrastructure. To enable Hadoop, Hive, and Spark on your instance, you will need to follow a two-step process: (i) configure your Hadoop cluster and (ii) build your Hadoop-enabled environments.
Hadoop Cluster Configuration¶
To configure your cluster, navigate to Administration in the top menu bar and click on the Hadoop Cluster tab. Select your Hadoop provider from the dropdown list to begin.
Fill in the form with the cluster name, provider version, and additional security information. Then, choose to enable Hadoop, Hive, and/or Spark by checking the respective boxes. When you choose to enable a certain framework, you will be prompted to add additional information in the form of configuration files that you will upload into the form. In this next section, you will learn how to locate and acquire these files.
Please gather configuration materials from edge nodes or client machines that have successfully connected to the cluster. Cluster server notes may not have the proper configurations.
Enabling Hadoop (Optional)¶
core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, hadoop-env.sh
These files can be found in
HADOOP_CONF_DIR. It is usually symbolically
/etc/hadoop/conf but different distributions of Hadoop will
lay them down in different places. Another common place for the actual files
will be under the installation directory at
If the files are empty or just the files with a
.template suffix are
present, it is not necessary to upload the files.
For more information, see Apache Hadoop’s documentation on Cluster Setup.
Please note that some of the configuration options in portions of this documentation referring to services like HDFS NameNode, YARN NodeManager, etc. will not affect the operation of the cluster, since these services will not run inside a DataScience.com Hadoop-enabled environment. Instead, they will remain on the external Hadoop cluster that the environment will connect to.
As of yet, we don’t override any Hadoop settings in the files that you upload, but we may override the settings in the files you upload for other services to ensure your clients can connect from inside the DataScience.com Hadoop-enabled environment. These settings will be outlined below.
Enabling Hive (Optional)¶
These files can be found in
HIVE_CONF_DIR. It is usually symbolically linked to
but different distributions of Hadoop will lay them down in different places.
Another common place for the actual files will be under the
installation directory at
For more information, see Apache Hive’s documentation on Configuring Hive.
Please note that
hive-site.xml may have the Hive Metastore password present
if you’re connecting to Hive without any additional security authentication
mechanism in place or if you take it from the server hosting HiveServer(2).
We may override the following properties in
If Tez is enabled, we don’t currently support uploading Tez-specific
configuration files. Instead, at runtime we inject the appropriate
configuration and properties to make sure the Tez jars are available on
HADOOP_CLASSPATH set in
hadoop-env.sh, and that the
hive.execution.engine is set properly in
These files can be found in
$SPARK_HOME/conf. It is usually
symbolically linked to
/etc/spark/conf, but different distributions of
Hadoop or an end user installation of Spark may or may not have set this
up. The below files may not be actually be present in the configuration
directory and only files with the
.template suffix exist. In this
case, it is not necessary to upload them to the Environments UI.
We override the following values in
Since Hive and Spark are on different development cycles, when Spark
integrates with Hive on a Hadoop cluster, it commonly uses a different
configuration and is even packaged with a different version of Hive than
is installed on the cluster. If that is the case, it is necessary to
hive-site.xml specifically for Spark. This should be found
$SPARK_HOME/conf as well.
Building a Hadoop-Enabled Environment¶
Build a Hadoop-Enabled Environment¶
Building a Hadoop-enabled environment is similar to building any Base environment, except all of the Dockerfile commands for installation are form-field driven. Simply navigate to the Environments screen, select Add Environment > Base Environment, and choose the Install Hadoop Dependencies option. Once selected, choose your Hadoop distribution as your provider, select your version, and build.
Currently, the DataScience.com Platform supports MapR versions 5.2.1 and 5.2.2 as well as Cloudera version 5.10.
For more information about building environments, see our Environment Management documentation.
After you have created an available Hadoop-enabled Base environment, create a User environment from this new Base environment to enable your users to connect to the cluster.
The DataScience.com Platform supports the use of user-level MapR Tickets to authenticate against your MapR cluster. There is full support for MapR Ticket authentication across the Platform, including interactive sessions and ad hoc and scheduled runs. To enable MapR Ticket authentication, select it as the Security option in the Cluster Configuration setup. Any Users who want to authenticate to the cluster will need to upload their personal MapR Ticket to the Platform. See the Account Setup documentation for more information.
The DataScience.com Platform supports Kerberos authentication to your Hadoop cluster for User workloads across the Platform including interactive sessions and ad hoc and scheduled runs. To enable Kerberos authentication, select it as the Security option in the Cluster Configuration setup. Any Users who want to authenticate to the cluster will need to upload their personal Kerberos credentials to the Platform. See the Account Setup documentation for more information.