Using Spark on the Platform

The DataScience.com Platform includes an Apache Spark integration. Users can connect to a pre-configured remote Spark cluster via Jupyter and RStudio.

Warning

Cloudera Spark 1.6 does not support Python 3.6.

Jupyter

Spark integration for Jupyter is supported via the following Sparkmagic kernels:

  • Spark (Scala)
  • PySpark (Python 2)
  • PySpark3 (Python 3)
  • SparkR (R)

DataScience.com Platform Jupyter notebooks connect to Spark using Livy, an open source REST interface for interacting with Apache Spark. Livy runs in an embedded mode in our integration and connects to Spark via cluster mode. Connecting to Spark via cluster mode means that each Spark session driver will run on the remote Spark cluster.

RStudio

RStudio support uses the sparklyr or the sparkr library.

Zeppelin

Spark support is built into Zeppelin via the Spark interpreter. See the Zeppelin documentation on this interpreter for more information.

Note

For workflows that expect low latency (e.g. deployed APIs or Shiny applications), we do not recommend querying directly against your data source with Spark, as the query response times can be significant. Instead, you could use Spark within an Interactive Session to write pre-queried data to another location for use in a deployed API or Shiny application.

Spark Usage

Sparkmagic

Sparkmagic has a great introduction on usage here.

A full list of Spark configurations for Spark on YARN can be found here. These are configurable via the %%configure Sparkmagic JSON object’s conf key/value property.

Spark in RStudio

You can find documentation on how to access Spark via the Sparklyr library from RStudio here.

Note

We recommend users launch at least a 2GB/1 CPU compute resource for Spark sessions.