Environments are customized, pre-installed collections of dependencies and packages that can be created by Admins and distributed to Users on the DataScience.com Platform. Users require environments to be built in order to do work on the Platform. Therefore, it is critical that an Admin create environments during the installation process. Additional environments can be built anytime thereafter.
In this guide, you will learn how to build environments on behalf of Users.
The DataScience.com Platform uses Docker to containerize workloads on your instance. Docker containers allow Users to spin up isolated work environments that have all of the software needed for their analysis pre-installed. A Docker container is a running instance created from a Docker Image. Docker Images are immutable files that define the runtime of containers. When a user runs a script or launches a Jupyter session on the Platform, they are running a Docker Image that is stored in an internal Docker Registry. The Environments feature allows Admin users to create their own Docker Images and submit them to the Docker Registry by writing a Dockerfile within the Platform interface.
If you are unfamiliar with Docker and Dockerfiles, check out Docker’s documentation for more details. Please also refer to our Dockerfile Basics and Best Practices documentation for several example Dockerfiles and best practices.
What is an Environment?¶
An environment is defined by a Dockerfile and is associated with metadata such as name, description, and a README. There are two categories of environments: Base and User.
On the DataScience.com Platform, all environments except for the Default Base environment must inherit from a pre-existing Base environment. All Base environments can be expanded in Base and User environments that inherit from it. On the other hand, User environments cannot be be used to seed other environments.
It is convenient to envision these relationships as an inheritance tree. Default Base is the root node of this tree. Any Base or User environments that are subsequently built are nodes branching off of it, with User nodes always representing leaf nodes. In this analogy, the main difference between Base and User environments is that the former can become parent nodes while the latter can only be a child node of a Base environment (i.e., a leaf).
A Base environment contains many of the fundamental packages that are necessary for a DataScience.com Platform container to run and connect to data in your Instance. There are three types of Base environments: Default, Custom, and Hadoop-enabled.
The Default Base Environment¶
The Default Base environment initializes the tree and must be built first. This environment is provided by DataScience.com; it contains the software that ensures containers will spawn and function successfully on the Platform. You cannot modify this environment’s Dockerfile, but you can extend it when you create custom or Hadoop Base environments.
The Default Base environment cannot be edited or deleted.
Customized Base Environments¶
Base environments can be customized to include the common languages and dependencies that you want to be readily available across environments. Custom Base environments can be created from each other by extending with Dockerfile commands.
A User environment is an image that is launchable in projects on the DataScience.com Platform, as they contain the tooling that is needed for User actions (launch a Jupyter/RStudio/Zeppelin session, run/schedule a script, publish an application, deploy an API). This is an additional place where the Dockerfile can be extended with customization.
How to Create Environments¶
Before You Begin¶
Prior to creating environments, please heed the following best practices to ensure a successful build:
- Do not build multiple environments (Base or User) concurrently.
- Please set up a Git repository to store Docker and context files used to build environments. This practice will not only make iteration and troubleshooting much easier, but also will enable tracking of an environment’s history. Add the commit number and repo URL to the description for an environment.
- Environments should not be rebuilt frequently to ensure that Platform users have consistent, standardized workspaces. We recommend testing and updating versions of packages, languages, libraries, drivers, and tools once a year.
- To keep the frequency of updates low, we do not recommend that you rebuild environments to add small numbers of packages. The additional packages can be installed within an end user’s session at runtime instead.
- If you want to replace an existing environment, build the new environment first and test it fully to ensure it performs as expected. Do this before deleting the older environment to be replaced.
- Before adding new packages to your environment, test the installation in a Jupyter, RStudio, or Zeppelin session first. This will allow you to identify any additional dependencies that may be needed for that package.
- If you are replacing the version of Python or R that comes with the Default Base environment (e.g. you want Python 3.6 or R 3.4.2), set up a Custom Base that has a minimal installation of the new version (few dependencies present) and then expand upon this Custom Base with the desired dependencies (e.g. comprehensive package list or Hadoop dependencies).
After installation is complete, navigate to the Environments page via the Environments link in the menu bar at the top of the page.
When you arrive on this page for the first time, you will see a button that allows you to create the first environment: the Default Base Environment. When you click Add Environment, the build process will be kicked off.
On the next page, you will be able to see the logs from the Docker build process. This should take a few minutes to complete. Once complete, click Confirm & Save. You can now use this as the basis for other environments and extend it as needed.
Once the Default environment is created, you can continue to customize your Base environment (recommended). At this stage, it is a good practice to create Base environments with the languages (and their versions), package managers, drivers, and common packages used by your organization.
To create a custom Base environment, click Add Environment > Base Environment on the Environments screen.
Give your new environment a name and description, and upload a README that tells your users a little more about how to use the environment and its intended purpose. Then, select Custom Dockerfile.
Choose another Base environment that you want to extend in the
This custom Base environment will have all the dependencies present in the environment that it is inheriting
FROM. When you are finished with this custom Base environment, you will see it in this dropdown for future customization.
In the Dockerfile text area, enter Dockerfile commands for installation of your intended packages. Refer to our Dockerfile Basics and Best Practices documentation for more information about how to write Dockerfiles in the DataScience.com Platform. There are a few restrictions on the commands that you can enter in this area:
- No absolute paths
- No tool installation
In the next step, you can upload a .tar file that contains any context files that you reference in your Dockerfile. A common context file to include in the .tar is a text file of requirements that lists all packages that will be installed.
When your new Base environment is customized to your specifications, you are ready to move on to the build step. In this next page you can see the logs from the installation process. It’s important to check these logs for any errors or skipped package installations. If the build does not complete to your satisfaction, click Cancel & Discard to start over. When you are satisfied with your build, click Confirm & Save to make it available.
Your new Base Environment is now available and you can see more detailed information about it by clicking on its listing or card in the Environments page.
In order to enable Hadoop and Spark connectivity for users on the Platform, there must be an available Hadoop-enabled environment. To create a Hadoop Base environment, click Add Environment > Base Environment on the Environments screen.
Give your new environment a name and description, and upload a README to inform your users about your environment and how to use it. Then, select Install Hadoop Dependencies.
Choose another Base environment that you want to extend in the
This new Hadoop Base environment will have all the dependencies present in the environment
that it is inheriting
FROM. When you are finished with this custom Base environment,
you will see it in this drop-down for future customization.
Select your Hadoop distribution provider and version number in the drop-down form fields. Once you’ve made your selection, you can enable the frameworks that your users require such as Spark, Hive, and Impala by selecting the version number from the drop-downs. If you wish to disable any of these frameworks, select Disabled.
Once you have made your selection appropriately, you can kick off the build. While the environment is building, you can see the logs to observe progress and any errors. You can cancel the build at any time by clicking Cancel & Discard. When the build is complete, you can click Confirm & Save to make it available. When a Base environment is available, you will be able to build other Base and User environments from it to add tools and further customizations.
Create a User Environment¶
To make tools available to the users of your instance, create a User environment by selecting Add Environment > User Environment on the Environments page.
As with the Base environment, enter a name, description, and README that will help your users understand what is included in this environment and its intended use.
As with custom Dockerfile Base environments, you can select a Base environment to extend and add additional commands in the Dockerfile text area. Use the Upload TAR button to submit context files referenced in your Dockerfile.
Next you must select the tool environments that you want to build to make available to users. In general, it’s good to build as many of these as makes sense for the packages you’ve installed. For example, if your environment contains only Python 3 and various Python packages, it makes sense to build Jupyter, Deployed API, and Script Run but not RStudio or Shiny for RStudio. If your environment has a mixture of both Python and R packages, it makes sense to build all available tools. Any tools not selected at this stage can be added once the environment has been built.
Once you are satisfied with your environment’s specifications, click Next Step: Build to move on to the build step where you can observe the installation logs and the tool building progress.
Click Cancel & Discard to start over; click Confirm & Save to make the environment available to Users. Once this is complete, Users will be able to launch these tools within their projects. See the user documentation on Environments and Dependencies for more information.
DataScience.com Standard Example Environments¶
During the installation process, your Admin will have built the DataScience.com Standard Example Base and User environments. These environments have Python 2.7 and R 3.3.3.
The DataScience.com Standard Example Base environment inherits from the Default Base Environment and expands upon it with a selection of popular Python and R data science packages that we have curated. In turn, the DataScience.com Standard Example User one inherits from its namesake base environment and has all the Platform tools enabled. Please refrain from deleting or modifying these environments. They have been designed to be fully compatible with all user onboarding, engagement, and education materials that we will provide. If these environments are modified in any way, we are unable to guarantee that these materials will be able to run successfully.