Skip to content

SLURM integration

Calkit can run pipeline stages on a SLURM job scheduler using the slurm environment kind. The calkit slurm CLI can then be used to monitor these jobs by their name in the context of a project.

For example, let's create a calkit.yaml file with a slurm environment and two sbatch stages:

# In calkit.yaml
environments:
  my-cluster:
    kind: slurm
    host: my.cluster.somewhere.edu

pipeline:
  stages:
    sim:
      kind: sbatch
      environment: my-cluster
      script_path: scripts/run-sim.sh
      inputs:
        - config/my-sim-config.yaml
      outputs:
        - results/all.h5
      slurm:
        options:
          - --time=60
    post-process:
      kind: sbatch
      environment: my-cluster
      script_path: scripts/post.sh
      inputs:
        - results/all.h5
      outputs:
        - results/post.h5
        - figures/myfig.png
      slurm:
        options:
          - --gpus=1
          - --time=20

When calling calkit run, as long as we're running from the project directory on the host my.cluster.somewhere.edu, the run-sim job will be submitted. By default, Calkit will wait for the job to finish, but will be robust to disconnecting. That is, if you disconnect and reconnect (or simply exit with ctrl+c), calling calkit run will check if the job is still running and wait for it if so.

If we wanted to submit both jobs at the same time, we could call calkit run sim, press ctrl+c to stop waiting, then call calkit run post-process.

If we want to check the status of any of the project's jobs, we can call calkit slurm queue, and if we wanted to cancel one, we can cancel it by name, e.g., calkit slurm cancel post-process.

Running non-shell (Python, Julia, R) scripts and notebooks

It's also possible to run Python, Julia, or R scripts in their respective environments within a SLURM "outer" environment using a nested environment syntax. Setting up stages this way will ensure the inner environments match their spec before running, and negates the need to activate them before running the job, or writing a shell script that creates and/or activates them. These pipeline stages will also be robust to SSH disconnects from the cluster and their outputs will be cached, which will be invalidated if any inputs are changed, just like any other pipeline stage.

Here's a simple example that uses a Conda environment to run a Python script and Jupyter notebook:

# In calkit.yaml
environments:
  my-cluster:
    kind: slurm
    host: my.cluster.somewhere.edu
  my-conda-env:
    kind: conda
    path: environment.yml

pipeline:
  stages:
    sim:
      kind: python-script
      environment: my-cluster:my-conda-env # <-- Nested env syntax
      script_path: scripts/run-sim.py
      inputs:
        - config/my-sim-config.yaml
      outputs:
        - results/all.h5
      slurm:
        options:
          - --time=60
    post-process-notebook:
      kind: jupyter-notebook
      notebook_path: notebooks/post-process.ipynb
      environment: my-cluster:my-conda-env
      inputs:
        - from_stage_outputs: sim
      slurm:
        options:
          - --gpus=1