Running and logging

When the pipeline is run with the calkit run command, Calkit will first compile the pipeline into a DVC pipeline with some additional stages to handle environment checking.

Calkit also collects important system information such as foundational dependency versions to log for the sake of traceability and saves to JSON files in .calkit/systems.

While the run is executing, DVC logs will be sent into .calkit/logs and after it's complete, run metadata will be saved to a JSON file in .calkit/runs. Again, these can be helpful for traceability and diagnosing reproducibility issues down the road, e.g., if the project is being run on multiple machines and the results are different between them.

The run metadata can be queried and analyzed, for example, with DuckDB:

import duckdb

duckdb.sql(
    """
        select
            system.node_id,
            system.calkit_version,
            cast(end_time as timestamp)
                - cast(start_time as timestamp) duration,
            status
            from '.calkit/runs/*.json' run
            left join '.calkit/systems/*.json' system
                on run.system_id = system.id
            where run.dvc_args = '[]'
    """
)

┌─────────────────┬────────────────┬─────────────────┬─────────┐
│     node_id     │ calkit_version │    duration     │ status  │
│      int64      │    varchar     │    interval     │ varchar │
├─────────────────┼────────────────┼─────────────────┼─────────┤
│ 138587250590302 │ 0.26.0         │ 00:00:02.481729 │ success │
│ 138587250590302 │ 0.26.0         │ 00:00:02.57566  │ success │
│ 138587250590302 │ 0.26.0         │ 00:00:04.728486 │ success │
│ 138587250590302 │ 0.26.0         │ 00:00:08.676203 │ success │
└─────────────────┴────────────────┴─────────────────┴─────────┘