The pipeline
The pipeline
defines and ties together the processes that produce
the project's important assets or artifacts, such as datasets,
figures, tables, and publications.
It is saved in the pipeline section of the calkit.yaml file,
and is compiled to a DVC pipeline (saved in dvc.yaml)
when calkit run is called.
A pipeline is composed of stages,
each of which has a specific type or "kind."
Each stage must specify the environment in which it runs to ensure it's
reproducible.
Calkit will automatically generate an "environment lock file"
at the start of a run
and can therefore automatically detect if an environment has changed,
and the affected stages need to be rerun.
Stages can also define inputs and outputs,
and you can decide how you'd like outputs to be stored, i.e., with Git or DVC.
Any stages that have not changed since they were last run will be skipped, since their results will have been cached.
In the calkit.yaml file, you can define a pipeline
(and environments) like:
# Define environments
environments:
main:
kind: uv-venv
path: requirements.txt
python: "3.13"
texlive:
kind: docker
image: texlive/texlive:latest-full
# Define the pipeline
pipeline:
stages:
collect-data:
kind: python-script
script_path: scripts/collect-data.py
environment: main
outputs:
- data/raw.csv
- path: data/meta.json
storage: git
delete_before_run: false
process-data:
kind: jupyter-notebook
notebook_path: notebooks/process.ipynb
environment: main
inputs:
- data/raw.csv
outputs:
- data/processed.csv
- figures/fig1.png
build-paper:
kind: latex
target_path: paper/paper.tex
environment: texlive
inputs:
- figures/fig1.png
- references.bib
Stage types and unique attributes
All stage declarations require a kind and an environment,
and can specify inputs and outputs.
The different kinds of stages and their unique attributes are listed below.
For more details, see calkit.models.pipeline.
python-script
script_pathargs(list, optional)
shell-command
commandshell(optional, e.g.,bash,sh,zsh; default:bash)
shell-script
script_pathshell(optional, e.g.,bash,sh,zsh; default:bash)args(list, optional)
matlab-script
script_path
latex
target_path
docker-command
command
r-script
script_pathargs(list, optional)
julia-script
script_pathargs
julia-command
command
sbatch
script_pathargssbatch_options
This stage type runs a script with sbatch, which is a common way to run
jobs on a high performance computing (HPC) cluster that uses the SLURM
job scheduler.
Iteration
Over a list of values
pipeline:
stages:
my-iter-stage:
kind: python-script
script_path: scripts/my-script.py
args:
- "--model={var}"
iterate_over:
- arg_name: var
values:
- some-model
- some-other-model
inputs:
- data/raw
outputs:
- models/{var}.h5
Over a table (or list of lists)
pipeline:
stages:
my-iter-stage:
kind: python-script
script_path: scripts/my-script.py
args:
- "--model={var1}"
- "--n_estimators={var2}"
iterate_over:
- arg_name: [var1, var2]
values:
- [some-model, 5]
- [some-other-model, 7]
inputs:
- data/raw
outputs:
- models/{var1}-{var2}.h5
Over ranges of numbers
pipeline:
stages:
my-iter-stage:
kind: python-script
script_path: scripts/my-script.py
args:
- "--thresh={thresh}"
iterate_over:
- arg_name: thresh
values:
- range:
start: 0
stop: 20
step: 0.5
- range:
start: 30
stop: 35
step: 1
- 41
inputs:
- data/raw
outputs:
- results/{thresh}.csv
Automatic stage and environment detection
The calkit xr command, which stands for "execute and record,"
can be used to automatically generate pipeline stages and environments from
scripts (Python, MATLAB, Julia, R, and shell),
notebooks, LaTeX source files, or shell commands.
For example, if you have a Python script in scripts/run.py, you can
call:
Calkit will attempt to detect which environment in which this script should run,
creating one if necessary (it can also be specified with the -e flag.)
Calkit will then try to detect inputs and outputs
and attempt to run the stage it created.
If successful, it will be added to the pipeline and kept reproducible from
that point onwards.
That is, calling calkit run again will detect if the script, environment,
or any input files have changed, and rerun if so.
What commands work best with xr
xr works best when your command has a clear executable and arguments,
or when the first argument is a recognized file type (for example .py,
.ipynb, .tex, .jl, .R, .m, .sh).
For Docker commands:
docker runcommands are supported.- For some CLI-style images (for example Mermaid CLI), Calkit converts the
command into a
commandstage and configures Dockerentrypointmode. - For other images, Calkit keeps a
shell-commandstage, infers a Docker environment from the image, and stores the inner command (the command run inside the container) as the stage command.
What I/O xr can usually detect
I/O detection is heuristic and depends on stage kind. It is strongest for:
- Python/R/Julia scripts with common file read/write APIs.
- Notebooks with straightforward file reads/writes.
- LaTeX includes and bibliography references.
- Shell commands that use redirection (
<,>,>>) and common file operations (for examplecpandmv).
For Docker shell commands, I/O detection is applied to the inner command
inside docker run, not the outer Docker wrapper.
I/O detection is less reliable when paths are dynamic (constructed at runtime, read from environment variables, generated in loops, or hidden behind custom wrappers).
When needed, provide explicit paths with:
--input(repeatable)--output(repeatable)--no-detect-ioto disable automatic detection completely
How environment detection works
At a high level, xr chooses environments in this order:
- Use
--environmentif provided. - Reuse an existing matching stage environment when possible.
- Infer from stage language and dependencies:
- Python: typically
pyproject.toml,requirements.txt,environment.yml, or a generated Python environment spec. - R: typically
DESCRIPTIONor a generatedrenvspec. - Julia: typically
Project.tomlor a generated Julia project spec. - LaTeX: typically a Docker LaTeX environment.
- For shell commands:
docker run ...can infer a Docker environment from the image.- non-Docker shell commands default to
_systemunless explicitly set.
If you want to inspect what xr would do without changing project files,
use the --dry-run option.