Using Spark on the Platform

The Platform includes an Apache Spark integration. Users can connect to a pre-configured remote Spark cluster via Jupyter and RStudio.


Cloudera Spark 1.6 does not support Python 3.6.


Spark integration for Jupyter is supported via the following Sparkmagic kernels:

  • Spark (Scala)
  • PySpark (Python 2)
  • PySpark3 (Python 3)
  • SparkR (R) Platform Jupyter notebooks connect to Spark using Livy, an open source REST interface for interacting with Apache Spark. Livy runs in an embedded mode in our integration and connects to Spark via cluster mode. Connecting to Spark via cluster mode means that each Spark session driver will run on the remote Spark cluster.


RStudio support uses the sparklyr or the sparkr library.


Spark support is built into Zeppelin via the Spark interpreter. See the Zeppelin documentation on this interpreter for more information.


For workflows that expect low latency (e.g. deployed APIs or Shiny applications), we do not recommend querying directly against your data source with Spark, as the query response times can be significant. Instead, you could use Spark within an Interactive Session to write pre-queried data to another location for use in a deployed API or Shiny application.

Spark Usage


Sparkmagic has a great introduction on usage here.

A full list of Spark configurations for Spark on YARN can be found here. These are configurable via the %%configure Sparkmagic JSON object’s conf key/value property.

Spark in RStudio

You can find documentation on how to access Spark via the Sparklyr library from RStudio here.


We recommend users launch at least a 2GB/1 CPU compute resource for Spark sessions.