Version control
Version control is one of the pillars of reproducibility. However, the current de facto version control system (VCS), Git, was designed primarily for complex software development projects, and thus provides lots of control and flexibility, but with a daunting learning curve.
Since most of the value of using version control for research projects comes from simply saving checkpoints of the project files so it's clear if they have changed, or so they can be reverted if something is broken, Calkit provides a simplified interface that focuses on just that. Additionally, since Git was not designed for large and/or binary files, Calkit uses DVC to version these file types.
GitHub is currently the most popular location to back up Git repositories, or repos, in the cloud, but like Git, is primarily designed for software development. Similar to how Calkit is a layer on top of Git, The Calkit Cloud (calkit.io) integrates with GitHub to add a more purpose-built interface for research projects. It also serves as a default DVC remote, so users are not required to provision their own.
Though Calkit adds a simplified interface on top of Git and DVC,
the lower-level tools git
and dvc
can be used if desired, e.g.,
for more complex operations.
Typical workflow
In order to start working on a project, the project repository must exist on your local machine. This can be achieved either by creating a new repo or downloading, or "cloning," an existing one from the cloud. After a repo exists on your local machine, it is typical to repeat the cycle of committing new or changed files with a message describing them, and then pushing those commits to the cloud. This can be achieved with three workflow variants that trade off automation for control.
The simplest and most hands-off uses calkit save
,
which will automatically make decisions about which files belong in Git
which belong in DVC, which don't belong in either,
commit them,
and push them to the cloud all with a single command:
graph LR
init[calkit clone
or
calkit new project]
--> edit[create or
edit files]
--> save[calkit save -am
'Your message here']
--> edit
The -a
flag indicates that we want to save all relevant files
and the -m
flag indicates that we are providing a message describing
the changes.
If omitted, Calkit will prompt the user for a message.
If more control is desired,
the save
step can be broken down into add
, commit
and push
steps.
graph LR
init[calkit clone
or
calkit new project]
--> edit[create or
edit files]
--> add["calkit add
{path1}
{path2}
..."]
--> commit[calkit commit -m
'Your message here']
--> push[calkit push]
--> edit
The add
step can be skipped for files that have been previously committed
by adding the -a
flag:
graph LR
init[calkit clone
or
calkit new project]
--> edit[create or
edit files]
--> commit[calkit commit -am
'Your message here']
--> push[calkit push]
--> edit
If you have a collaborator working on the same project, any time they have pushed commits, you will need to pull before being able to push yourself.
Command reference
To view the help for any of these commands,
execute calkit {command} --help
.
clone
calkit clone
will download and create a local copy of the project,
setup the default Calkit DVC remote and pull any files versioned with DVC.
The multi-step equivalent would be:
git clone
calkit config remote
dvc pull
If the project is hosted on the Calkit Cloud, it can be referenced by name rather than Git repo URL. For example:
status
calkit status
will show the combined status from both Git and DVC.
For example:
$ calkit status
--------------------------- Code (Git) ---------------------------
On branch main
nothing to commit, working tree clean
--------------------------- Data (DVC) ---------------------------
No changes.
------------------------- Pipeline (DVC) -------------------------
Data and pipelines are up to date.
save
calkit save
will create a commit and push to the remotes in one step.
It will automatically any ignore any files it deems to be inappropriate to
save in version control.
This provides the most automated and hands of experience
but gives the least control.
Options:
[PATHS]...
: Specify a list of paths to save. Not required if--all
is specified.--all
,-a
: Save all paths.--to
,-t
: Manually specifygit
ordvc
as the tracking mechanism.--message
,-m
: Specify a commit message. If omitted, the user will be prompted for one.--no-push
: Do not push after committing.
add
calkit add
will add a file to the repo "staging area,"
which sets it up to be committed.
Calkit will determine based on its type and size if it should be tracked
with Git or DVC and act accordingly.
Options:
--to
,-t
: Manually specifygit
ordvc
as the tracking mechanism.--commit-message
,-m
: Create a commit after adding and use the provided message.--auto-commit-message
,-M
: Commit with an automatically-generated message. Only compatible when adding one path.--push
: Push to the Git or DVC remote after committing.