Best Practices: Choosing the Right Container Size

In this article, we discuss how to allocate resources for Docker containers running on the DataScience.com Platform. If you’re new to Docker containers, please see Docker’s documentation for a helpful overview. For the purposes of this article, all you need to know is container resource allocation.

All DataScience.com Platform services (launching a session, running a script, scheduling a job, deploying a model, etc.) run on containers. Containers are sandboxed environments running on your on-premise infrastructure or in the cloud. When you launch a service on the Platform, a new container is created on one of your host machines. You can specify the maximum amount of memory and CPU that the service is allowed to use on the host.

Though services running on the same host are isolated from each other, they all share the host’s CPU and memory. If too many resource-intensive containers are competing on the same host, the following may occur:

  • Services may slow down or error out.
  • You may be unable to launch new services on the host.

Strategic Resource Allocation

To avoid complications, it’s important to allocate resources for your containers strategically. You want to ensure that your container runs with enough resources to run your code and you want to write code that requires as few resources as possible.

To determine the CPU and memory requirements of your code, use the following profiling tools:

  • memory_profiler for memory usage in Python
  • psutil for CPU usage in Python
  • profr, profviz and RStudio profiler for memory usage in R
  • top or htop for memory and CPU usage or any script

To determine the amount of resources available in your environment, Admin users can view the DataScience.com Platform Resources page, accessible through the avatar dropdown.

To avoid overutilizing resources:

  • Schedule jobs at off-peak times.
  • Terminate and close unused models and sessions.
  • (If enabled) use on-demand resources when few resources are available.

Code Best Practices

To reduce the footprint of your code, here are some tips:

  • Use out-of-core tools when working with large datasets, such as dask for Python or BigMemory for R.
  • Use sparse matrices when working with sparse data.
  • Use batch training when possible.
  • Delete used objects and object references to enable garbage collection.
  • Employ vectorization techniques whenever possible (available through libraries like pandas and NumPy).
  • Make use of libraries that implement efficient data structures and algorithms and avoid manually implementing machine learning algorithms unless necessary.
  • Omit useless variables from learning algorithms and models. Use feature selection algorithms to successfully subset your feature space.
  • Generally speaking, use Python 3 instead of Python 2. Python 3 contains several optimizations over Python 2.
  • Use generators when possible to avoid holding iterables in memory.
  • If reusing a model, load serialized objects in lieu of retraining, or initialize models with pretrained weights.
  • Cache intermediary results.
  • Only import what you need. For example: in Python, if you’re just using a single function from a large library, just import the function.
  • Push numerical calculations to more efficient languages like C/C++ or FORTRAN (see Cython, Rcpp).