Working in a Session

Sessions combine interactive data science tools with packages and compute resources. Sessions are perfect for iterative analytical work, such as exploratory data analysis or feature engineering. The Platform currently supports Jupyter, RStudio, and Zeppelin.

When you launch a session, you may select from the the default set of environments created by your administrator. You can install additional libraries once inside a session, just like you would on a regular laptop (for example, using pip in Python). To learn more, see the Environments section.

The DataScience.com Platform currently supports three interactive session tools:

  • Jupyter: Jupyter is a staple in the Python open source data community, but has kernels for R and many other languages. For more resources, see the Project Jupyter community page.
  • RStudio: RStudio is a fully-featured development environment primarily for R programmers. The DataScience.com Platform supports the open source version of RStudio. For more information, see their docs.
  • Zeppelin: Zeppelin is an interactive notebook tool from Apache with support for multiple languages, including Spark and SQL. For more information, see their website.

Launch a Session

To start a session, select Launch a Session from the project actions drop-down in the upper right, then configure the following options:

  • Branch: Determine the branch of your repo that you’ll work on. The files from the most recent commit on your branch will be available in the session.
  • Name: Opt whether to name the session to help you keep track of multiple concurrent sessions.
  • Tool: Choose an interactive tool to use in your session: Jupyter, RStudio, or Zeppelin.
  • Compute Resource: Select from a list of machine sizes specified by your Administrator.
  • Environment: Choose a set of pre-installed libraries. For more on environments, see the Environments and Dependencies page .
  • Additional Requirements: Install additional dependencies at runtime from a text file. For more on additional requirements, see the Environments and Dependencies page.
_images/Launch-a-session.png

You can navigate back to a running session from your project’s Activity tab, or from the Running Resources menu next to your avatar drop-down, shown here:

_images/Active-sessions.png

Notes on Using Zeppelin

For Zeppelin sessions, you will additionally be prompted to specify the Notes Directory. This is the directory where all of your notebooks will be synced in your project repository. Zeppelin will automatically save your notebook contents as a .json file in a directory with a randomly generated name. To keep your repository clean and organized, we recommend making a dedicated directory to house all of your Zeppelin notebooks.

In a Zeppelin session, you will have access to multiple language interpreters. The following interpreters are automatically installed in the DataScience.com Platform Zeppelin tool:

  • Spark (includes Spark, Scala, pySpark, sparkR, SQL)
  • Shell
  • JDBC (includes Hive)
  • Python
  • HDFS
  • Markdown

Warning

The Zeppelin Spark interpreter running in yarn-client mode is not supported with Cloudera and Spark v1.6. Only local mode is supported.

Warning

For Zeppelin sessions, you must use a compute resource with at least 1 GB of memory.

Sync Changes

Just like traditional Git workflows on a personal computer, sessions clone from a branch, changes are staged (automatically by the Sync menu), and then you push your changes with a commit message back to the Git remote.

After you’ve made some changes to your files in a session, you can save them by syncing back to the Git repo. From the top Platform chrome bar in your session, click the Sync Changes button.

_images/Sync-changes.png

On the Sync menu, you’ll see which files have been added, deleted, or modified. Using the checkboxes, you can select which files you would like to sync. You can enter an optional commit message and then sync your changes back to the Git repo.

_images/Sync-changes-menu.png

Warning

Be mindful of file sizes. Most Git providers have size limits for files you can store. For example, GitHub limits files to 100MB. Also, the DataScience.com Platform web app has a upload/download limit of 200MB, which affects downloading files from the Jupyter file browser.

If the file changes you’ve made don’t conflict with changes your team has made since you started your session, the Platform will push all your files as a new commit to the active branch.

If there are conflicts, you’ll have two choices:

  • Cancel: This option reverts your Git status back to the moment you hit Sync. You may keep working and manually resolve conflicts using the Jupyter or RStudio file editors.
  • Create Branch: This option creates a new branch and pushes your changes to that branch. The parent of the branch will be the commit that was originally loaded into your Session.

Git Commands Behind the Scenes

Below are the exact commands that run for each Sync feature.

Loading the Sync menu:

git status

Sync action:

git add .
git commit -m <message you provide>
git fetch
git merge <branch you chose when launching> --no-commit --no-ff

Cancelling a sync after a conflict:

git reset

Creating a new branch after a conflict:

git branch <name you provide>

Shut Down a Session

A session will run and consume compute resources until you stop it. To shut down a session, click the Shutdown button in the top Platform chrome bar in your session.

_images/Shutdown.png

Warning

You can’t recover unsaved changes from a session after shutting down. If you want to save the work you have done, make sure to sync your files before shutting down.