SLURM integration

Calkit can run pipeline stages on a SLURM job scheduler using the slurm environment and sbatch stage types. The calkit slurm CLI can then be used to monitor these jobs by their name in the context of a project.

For example, let's create a calkit.yaml file with a slurm environment and two sbatch stages:

# In calkit.yaml
environments:
  my-cluster:
    kind: slurm
    host: my.cluster.somewhere.edu

pipeline:
  stages:
    sim:
      kind: sbatch
      environment: my-cluster
      script_path: scripts/run-sim.sh
      inputs:
        - config/my-sim-config.yaml
      outputs:
        - results/all.h5
      sbatch_options:
        - --time=60
    post-process:
      kind: sbatch
      environment: my-cluster
      script_path: scripts/post.sh
      inputs:
        - results/all.h5
      outputs:
        - results/post.h5
        - figures/myfig.png
      sbatch_options:
        - --gpus=1
        - --time=20

When calling calkit run, as long as we're running from the project directory on the host my.cluster.somewhere.edu, the run-sim job will be submitted. By default, Calkit will wait for the job to finish, but will be robust to disconnecting. That is, if you disconnect and reconnect (or simply exit with ctrl+c), calling calkit run will check if the job is still running and wait for it if so.

If we wanted to submit both jobs at the same time, we could call calkit run sim, press ctrl+c to stop waiting, then call calkit run post-process.

If we want to check the status of any of the project's jobs, we can call calkit slurm queue, and if we wanted to cancel one, we can cancel it by name, e.g., calkit slurm cancel post-process.