Using Spark on the Platform¶
The DataScience.com Platform includes an Apache Spark integration. Users can connect to a pre-configured remote Spark cluster via Jupyter and RStudio.
Cloudera Spark 1.6 does not support Python 3.6.
Spark integration for Jupyter is supported via the following Sparkmagic kernels:
- Spark (Scala)
- PySpark (Python 2)
- PySpark3 (Python 3)
- SparkR (R)
DataScience.com Platform Jupyter notebooks connect to Spark using Livy, an open source REST interface for interacting with Apache Spark. Livy runs in an embedded mode in our integration and connects to Spark via cluster mode. Connecting to Spark via cluster mode means that each Spark session driver will run on the remote Spark cluster.
RStudio support uses the sparklyr or the sparkr library.
Spark support is built into Zeppelin via the Spark interpreter. See the Zeppelin documentation on this interpreter for more information.
For workflows that expect low latency (e.g. deployed APIs or Shiny applications), we do not recommend querying directly against your data source with Spark, as the query response times can be significant. Instead, you could use Spark within an Interactive Session to write pre-queried data to another location for use in a deployed API or Shiny application.
Sparkmagic has a great introduction on usage here.
A full list of Spark configurations for Spark on YARN can be found
These are configurable via the
%%configure Sparkmagic JSON object’s
conf key/value property.