Skip to content

The pipeline

The pipeline defines and ties together the processes that produce the project's important assets or artifacts, such as datasets, figures, tables, and publications. It is saved in the pipeline section of the calkit.yaml file, and is compiled to a DVC pipeline (saved in dvc.yaml) when calkit run is called.

A pipeline is composed of stages, each of which has a specific type or "kind." Each stage must specify the environment in which it runs to ensure it's reproducible. Calkit will automatically generate an "environment lock file" at the start of a run and can therefore automatically detect if an environment has changed, and the affected stages need to be rerun. Stages can also define inputs and outputs, and you can decide how you'd like outputs to be stored, i.e., with Git or DVC.

Any stages that have not changed since they were last run will be skipped, since their results will have been cached.

In the calkit.yaml file, you can define a pipeline (and environments) like:

# Define environments
environments:
  main:
    kind: uv-venv
    path: requirements.txt
    python: "3.13"
  texlive:
    kind: docker
    image: texlive/texlive:latest-full

# Define the pipeline
pipeline:
  stages:
    collect-data:
      kind: python-script
      script_path: scripts/collect-data.py
      environment: main
      outputs:
        - data/raw.csv
        - path: data/meta.json
          storage: git
          delete_before_run: false
    process-data:
      kind: jupyter-notebook
      notebook_path: notebooks/process.ipynb
      environment: main
      inputs:
        - data/raw.csv
      outputs:
        - data/processed.csv
        - figures/fig1.png
    build-paper:
      kind: latex
      target_path: paper/paper.tex
      environment: texlive
      inputs:
        - figures/fig1.png
        - references.bib

Stage types and unique attributes

All stage declarations require a kind and an environment, and can specify inputs and outputs. The different kinds of stages and their unique attributes are listed below. For more details, see calkit.models.pipeline.

python-script

  • script_path
  • args (list, optional)

shell-command

  • command
  • shell (optional, e.g., bash, sh, zsh; default: bash)

shell-script

  • script_path
  • shell (optional, e.g., bash, sh, zsh; default: bash)
  • args (list, optional)

matlab-script

  • script_path

latex

  • target_path

docker-command

  • command

r-script

  • script_path
  • args (list, optional)

julia-script

  • script_path
  • args

julia-command

  • command

sbatch

  • script_path
  • args
  • sbatch_options

This stage type runs a script with sbatch, which is a common way to run jobs on a high performance computing (HPC) cluster that uses the SLURM job scheduler.

Iteration

Over a list of values

pipeline:
  stages:
    my-iter-stage:
      kind: python-script
      script_path: scripts/my-script.py
      args:
        - "--model={var}"
      iterate_over:
        - arg_name: var
          values:
            - some-model
            - some-other-model
      inputs:
        - data/raw
      outputs:
        - models/{var}.h5

Over a table (or list of lists)

pipeline:
  stages:
    my-iter-stage:
      kind: python-script
      script_path: scripts/my-script.py
      args:
        - "--model={var1}"
        - "--n_estimators={var2}"
      iterate_over:
        - arg_name: [var1, var2]
          values:
            - [some-model, 5]
            - [some-other-model, 7]
      inputs:
        - data/raw
      outputs:
        - models/{var1}-{var2}.h5

Over ranges of numbers

pipeline:
  stages:
    my-iter-stage:
      kind: python-script
      script_path: scripts/my-script.py
      args:
        - "--thresh={thresh}"
      iterate_over:
        - arg_name: thresh
          values:
            - range:
                start: 0
                stop: 20
                step: 0.5
            - range:
                start: 30
                stop: 35
                step: 1
            - 41
      inputs:
        - data/raw
      outputs:
        - results/{thresh}.csv

Automatic stage and environment detection

The calkit xr command, which stands for "execute and record," can be used to automatically generate pipeline stages and environments from scripts (Python, MATLAB, Julia, R, and shell), notebooks, LaTeX source files, or shell commands.

For example, if you have a Python script in scripts/run.py, you can call:

calkit xr scripts/run.py

Calkit will attempt to detect which environment in which this script should run, creating one if necessary (it can also be specified with the -e flag.) Calkit will then try to detect inputs and outputs and attempt to run the stage it created. If successful, it will be added to the pipeline and kept reproducible from that point onwards. That is, calling calkit run again will detect if the script, environment, or any input files have changed, and rerun if so.

What commands work best with xr

xr works best when your command has a clear executable and arguments, or when the first argument is a recognized file type (for example .py, .ipynb, .tex, .jl, .R, .m, .sh).

For Docker commands:

  • docker run commands are supported.
  • For some CLI-style images (for example Mermaid CLI), Calkit converts the command into a command stage and configures Docker entrypoint mode.
  • For other images, Calkit keeps a shell-command stage, infers a Docker environment from the image, and stores the inner command (the command run inside the container) as the stage command.

What I/O xr can usually detect

I/O detection is heuristic and depends on stage kind. It is strongest for:

  • Python/R/Julia scripts with common file read/write APIs.
  • Notebooks with straightforward file reads/writes.
  • LaTeX includes and bibliography references.
  • Shell commands that use redirection (<, >, >>) and common file operations (for example cp and mv).

For Docker shell commands, I/O detection is applied to the inner command inside docker run, not the outer Docker wrapper.

I/O detection is less reliable when paths are dynamic (constructed at runtime, read from environment variables, generated in loops, or hidden behind custom wrappers).

When needed, provide explicit paths with:

  • --input (repeatable)
  • --output (repeatable)
  • --no-detect-io to disable automatic detection completely

How environment detection works

At a high level, xr chooses environments in this order:

  1. Use --environment if provided.
  2. Reuse an existing matching stage environment when possible.
  3. Infer from stage language and dependencies:
  4. Python: typically pyproject.toml, requirements.txt, environment.yml, or a generated Python environment spec.
  5. R: typically DESCRIPTION or a generated renv spec.
  6. Julia: typically Project.toml or a generated Julia project spec.
  7. LaTeX: typically a Docker LaTeX environment.
  8. For shell commands:
  9. docker run ... can infer a Docker environment from the image.
  10. non-Docker shell commands default to _system unless explicitly set.

If you want to inspect what xr would do without changing project files, use the --dry-run option.