SLURM integration
Calkit can run pipeline stages on a SLURM job scheduler
using the slurm environment and sbatch stage types.
The calkit slurm CLI can then be used to monitor these jobs
by their name in the context
of a project.
For example, let's create a calkit.yaml file with a slurm environment
and two sbatch stages:
# In calkit.yaml
environments:
  my-cluster:
    kind: slurm
    host: my.cluster.somewhere.edu
pipeline:
  stages:
    sim:
      kind: sbatch
      environment: my-cluster
      script_path: scripts/run-sim.sh
      inputs:
        - config/my-sim-config.yaml
      outputs:
        - results/all.h5
      sbatch_options:
        - --time=60
    post-process:
      kind: sbatch
      environment: my-cluster
      script_path: scripts/post.sh
      inputs:
        - results/all.h5
      outputs:
        - results/post.h5
        - figures/myfig.png
      sbatch_options:
        - --gpus=1
        - --time=20
When calling calkit run, as long as we're running from the project
directory on the host
my.cluster.somewhere.edu,
the run-sim job will be submitted.
By default, Calkit will wait for the job to finish, but will be robust
to disconnecting.
That is, if you disconnect and reconnect (or simply exit with ctrl+c),
calling calkit run will check if the job is still running and wait
for it if so.
If we wanted to submit both jobs at the same time, we could call
calkit run sim, press ctrl+c to stop waiting,
then call calkit run post-process.
If we want to check the status of any of the project's jobs, we can
call calkit slurm queue,
and if we wanted to cancel one,
we can cancel it by name, e.g.,
calkit slurm cancel post-process.