Declarative, structured prompting to make AI agents work reproducibly

AI chat bots and coding agents have increased the speed of production of sophisticated research artifacts like figures. However, without proper instruction, or prompting, they can gladly generate irreproducible outputs, much like a human can using an interactive tool. In this tutorial we'll show how to use a calkit.yaml file as a sort of declarative, structured prompt to provide the bones on which AI coding agents can generate artifacts, while keeping the processes easy to verify and iterate on.

If you don't have a Calkit project already, you can create one with calkit new project, or if you've already been working in an AI chat session, ask it to export a Calkit project for the work it's done so far---literally just:

Export a Calkit project that produces this figure.

After that, open up the project in, e.g., VS Code, and start iterating on it with a coding agent like Claude Code, GitHub Copilot, or Codex.

However, there is one important thing to check about the exported project. We need to make sure the bot provides all of the primary artifacts, not just the secondary or derived ones. This can get a little tricky, since it may export a script, which is a primary artifact, but if it hard-codes data it collected from elsewhere, that data is a derived artifact. This can be seen in this Gemini Chat, and the code it shared:

import matplotlib.pyplot as plt
import numpy as np

# Data representing a decade of cumulative growth trends in Los Angeles County (Indexed to 2012 = 100%)
# Based on California EDD (Employment) and US Census/HUD/SCAG housing production data tracking.
years = np.array([2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022])

# Cumulative Job Growth (%)
job_growth = np.array([0.0, 2.3, 4.8, 7.5, 9.8, 11.4, 13.1, 14.5, 6.2, 10.1, 13.8])

# Cumulative Housing Unit Growth (%)
housing_growth = np.array([0.0, 0.4, 0.9, 1.5, 2.2, 3.0, 3.8, 4.6, 5.3, 6.1, 6.8])

plt.figure(figsize=(10, 6))
plt.plot(years, job_growth, label='Cumulative Job Growth (%)', color='#1f77b4', linewidth=2.5, marker='o')
plt.plot(years, housing_growth, label='Cumulative Housing Growth (%)', color='#ff7f0e', linewidth=2.5, marker='s')

plt.title('The Jobs-Housing Mismatch in Los Angeles County\n(Cumulative Growth % Baseline: 2012)', fontsize=14, fontweight='bold', pad=15)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Cumulative Growth (%)', fontsize=12)
plt.xticks(years)
plt.grid(True, linestyle='--', alpha=0.6)
plt.legend(fontsize=11, loc='upper left')

# Annotating the visual divergence point
plt.annotate('COVID-19 Job Shocks', xy=(2020, 6.2), xytext=(2017, 7.5),
             arrowprops=dict(facecolor='black', shrink=0.05, width=1, headwidth=6),
             fontsize=10, bbox=dict(boxstyle="round,pad=0.3", fc="yellow", alpha=0.3))

plt.tight_layout()
plt.savefig('la_jobs_vs_housing.png', dpi=300)
print("Plot successfully created and saved as la_jobs_vs_housing.png")

Gemini cited a source for the data, but it's going to take us quite a bit of work to verify the source. Therefore, by our quantitative measure of reproducibility, it could be a bit better.

It may be even more subtle and tell you it saved a CSV file of data it collected from various sources, as seen in this chat with Claude. Just like when reviewing an output generated by a human, it's much better if they give you something runnable that produces the output, i.e., "don't tell me, show me." We similarly need to request that the bot creates a pipeline stage that collects that data so we know the full provenance in a deterministic sense.

After you have a Calkit project started, you can prompt an agent by, e.g., declaring a new figure in calkit.yaml, like:

figures:
  - path: figures/housing-units-over-time.png
    title: Housing production over time
    description: A plot of the number of housing units produced each year
      in every major US city since 1930.
    stage: plot-housing-over-time

Then, tell the agent to create the stage that produces that figure. If you want it to export a derived dataset, you can declare that in the datasets list. You can tell it to create and run the stage in one go as well. With the agent skills installed, it will know what to do and can iterate on your pipeline quite efficiently.

If desired, you could start filling in as much detail as you'd like, e.g., if you knew you wanted the plotting to be done in a Jupyter notebook:

pipeline:
  stages:
    plot-housing-over-time:
      kind: jupyter-notebook
      notebook_path: notebooks/housing.ipynb
      description: Plots housing production over time.
      outputs:
        - figures/housing-units-over-time.png

Then, simply tell the agent:

Finish the plot-housing-over-time stage and run it.

This way of working is analogous to the contrast between declarative and imperative programming styles. Instead of telling the agent all of the steps to take, we declare the desired end result in a structured way, and let the agent fill in the details. Since this is happening in the context of a Calkit project, after the agent is done creating and running the pipeline, we can try rerunning it or check its status. All of the steps taken, including the creation of computational environments, are embedded inside the project so there is no ambiguity. That is, we don't need to trust that the AI agent did what it said it did, just like if a human ships their outputs along with a Calkit project.

After the pipeline is built, iteration is easy. If you don't like how a plot looks, tell the agent how to change it. It will change the code and run the pipeline in one go. You can take a look at the output and then check calkit status to be sure the agent did what it claimed it did.