Harness Engineering

BSW.DevSpark's harness runtime is an optional CLI execution layer for repeatable engineering workflows. It is additive: the prompt-first slash-command workflow remains unchanged, while the CLI adds a way to validate, execute, and inspect declarative workflow specs.

This page documents what is currently implemented in the repository.

CLI required — Everything on this page requires the BSW.DevSpark CLI. Install it once with:

uv tool install devspark-cli --force --find-links https://bsw-devspark.bswhive.com/dist/cli/

Prompt-first slash commands (/devspark.*) work without the CLI and are documented in the Implementation Lifecycle Guide.


devspark.run — Development Workflow Aliases

devspark run <alias> is the fastest path through the spec-driven development cycle when the CLI is installed. It chains atomic prompts into a single terminal command with built-in pause points and structured artifact output.

These are CLI-only commands. There is no /devspark.run slash command and no backing file exists in .claude/commands/. Without the CLI, run the atomic commands manually in sequence (see Without the CLI below).

Available Aliases

Alias Chains Pause point Output
create-spec specify → plan → tasks → analyze After analyze — review before implementing Reviewable spec artifact
execute-plan implement → create-pr → pr-review After create-pr — confirm PR before review runs Pull request
suggest-improvement capture-context → classify-improvement → create-issue → (assign-agent) → (implement) None by default; pass --yes to skip confirmation GitHub issue link
full-cycle specify → plan → tasks → critic → analyze → tasks (remediate) → implement → create-pr → pr-review None — autonomy.level: autonomous with guardrails instead Pull request

full-cycle is intentionally different from the other three: it has no pause_after/review_after steps at all. Safety comes from guardrails (file/line/path limits — see Autonomy Model), which hard-block rather than pause under autonomy.level: autonomous. Like the others, devspark run full-cycle only sequences steps and tracks telemetry — it expects an agent already driving the conversation to execute each one. For execution with no agent watching, see Full Unattended Lifecycle below.

Usage

# Start a new feature from scratch
devspark run create-spec

# Execute an existing plan through to a reviewed PR
devspark run execute-plan

# File a workflow improvement against bswhealth/bsw.devspark
devspark run suggest-improvement
devspark run suggest-improvement --yes    # skip confirmation prompt

Pause and Resume

create-spec and execute-plan pause at defined checkpoints so a human can review before the workflow continues. When a pause fires, the CLI prints the exact resume command:

devspark resume <run_id>

Pause state is saved at .documentation/telemetry/runs/<run_id>.json. On resume, BSW.DevSpark verifies the persisted schema_version, workflow_id, and context_checksum — any mismatch exits with code 25 (EXIT_RESUME_FAILED).

Active paused runs can be listed with:

devspark runs list

Full Development Cycle with devspark.run

The two aliases cover the entire feature lifecycle when used in sequence:

devspark run create-spec
# → review analyze output, then resume or continue:
devspark run execute-plan
# → review and merge PR, then release:
devspark release <version>

For the full command order including release, see the Full Development → Release Cycle on the home page.

Without the CLI

Use the atomic slash commands directly in your agent:

devspark.run alias Manual equivalent (no CLI required)
create-spec /devspark.specify/devspark.plan/devspark.tasks/devspark.analyze
execute-plan /devspark.implement/devspark.create-pr/devspark.pr-review
suggest-improvement /devspark.specify with improvement framing, then file the issue manually
full-cycle /devspark.specify/devspark.plan/devspark.tasks/devspark.critic/devspark.analyze/devspark.tasks (re-run, merges gate findings) → /devspark.implement/devspark.create-pr/devspark.pr-review

When to Use It

Use the harness runtime when you need terminal-driven execution, repeatable local validation, or a structured audit trail for a workflow that should run the same way more than once.

Good fits:

  • validate a harness spec before using it in a repeatable workflow
  • run a repo or app-scoped engineering sequence and capture artifacts
  • inspect why a prior run failed, retried, or aborted
  • verify adapter availability on a new machine

Less suitable fits:

  • ad hoc product work that already fits the prompt-first /devspark.* flow
  • one-off changes where a full execution spec would add more overhead than value

Command Surface

These are CLI commands, not slash commands.

devspark doctor
devspark harness validate sample.harness.yaml
devspark harness run sample.harness.yaml --dry-run
devspark harness trace latest
devspark adapter list
devspark adapter default claude_code

devspark doctor

Checks whether the current environment is ready for harness workflows.

Current checks include:

  • Python 3.11+
  • pydantic importability
  • compatible project layout
  • readable and valid agents-registry.json
  • git availability
  • required local CLIs for agent integrations that declare requires_cli

The command accepts both installed-project layouts with .devspark/ and compatible source checkouts with .documentation/, pyproject.toml, and src/devspark_cli/.

devspark harness validate

Loads a YAML or JSON harness spec, validates it against the current Pydantic model and schema expectations, and exits without executing any steps.

Use it before committing a new spec or before a real run.

devspark harness run

Executes a harness spec sequentially, evaluates validations after each step, persists artifacts, and returns structured exit codes.

Important current behavior:

  • exit codes are 0 complete, 1 failed, 2 aborted, 3 validation error
  • --dry-run writes a run record without executing step actions
  • --adapter overrides the adapter for executable steps
  • --adapter-default uses the saved user default adapter when present

devspark harness trace

Reads events.jsonl from a prior run and renders the recorded event stream. Use an explicit run ID or latest.

devspark adapter list

Lists the built-in adapters, whether each is available on the current machine, and the currently saved default.

devspark adapter doctor

Produces normalized readiness states for each adapter:

  • ready
  • write_approval_required
  • write_incompatible
  • unavailable

Use this before hands-off lifecycle runs to confirm the selected adapter can execute write-required stages without interactive approval.

devspark adapter default

Persists a local default adapter in the user's config directory. This does not modify .devspark/ or .documentation/, so upgrades do not overwrite the preference.

Built-In Adapters

The current built-in adapters are:

  • noop
  • manual
  • claude_code
  • codex
  • copilot
  • cursor

noop

Safe default for contract tests, dry runs, and environments without an AI tool installed.

manual

Displays the prompt for a human operator and waits for an acknowledgement keypress. It requires a TTY. In non-interactive contexts it fails clearly instead of silently skipping the gate.

claude_code, codex, copilot, cursor

These adapters call the corresponding local CLI (claude, codex, copilot, cursor-agent) if it is installed. Prompt content is sent through standard input rather than as a command-line argument, which avoids Windows command-length issues for larger prompts. Run devspark adapter list to see which of these are actually available on the current machine.

Spec Model

Harness specs are YAML or JSON documents with:

  • apiVersion: devspark.ai/v1
  • kind: HarnessSpec
  • name
  • scope
  • defaults
  • steps
  • telemetry

The checked-in examples are sample.harness.yaml (minimal, demonstrates each validation rule type) and full-cycle.harness.yaml (the full specify-through-pr-review lifecycle as agent_task steps — see Full Unattended Lifecycle below).

Step types currently implemented:

  • agent_task
  • validation
  • human_gate

Validation rule types currently implemented:

  • always.pass
  • file.exists
  • file.contains
  • command.exit_code
  • json.schema
  • git.clean
  • regex.match

Scope Resolution

Harness runs support repository scope and application scope.

  • scope.type: repo writes under the repository's .documentation/devspark/runs/
  • scope.type: app requires a valid multi-app registry and resolves the documentation root through the existing scope system

Current guardrails:

  • the repository root is derived from the spec path, not the caller's current working directory
  • malformed or path-invalid multi-app registries fail clearly instead of being treated as missing
  • ambiguous scope resolution is surfaced as a harness spec error

Run Artifacts

By default, telemetry writes to .documentation/devspark/runs/<run-id>/.

Current artifact layout includes:

  • spec.resolved.yaml
  • context.json
  • events.jsonl
  • result.json
  • adapter-doctor.json
  • decision-packet.json
  • steps/<step-id>/prompt.md when a prompt was materialized
  • steps/<step-id>/output.txt when an adapter produced output
  • steps/<step-id>/stdout.txt for command.exit_code validation output

Conditional artifacts:

  • no-change-explainer.md when workflow completed but delivery evidence was unmet
  • max-pass-failure-report.md when hands-off convergence reaches max passes without resolution

Runs are retained with a user-configurable limit. The default retention limit is 20.

Retry and Validation Behavior

After each executable step, the runner evaluates the declared validations.

  • error-severity failures block success
  • warning-severity failures are recorded but do not block the run
  • retry policies can request another attempt on validation failure
  • retry repair prompts append a ## Validation Errors section to the next adapter prompt
  • requireHumanAfter can force a manual pause after a configured attempt count

If a run is interrupted, the current implementation preserves the artifacts already written and records the run as aborted.

Operator Guidance

Recommended flow for a new spec:

  1. Run devspark doctor on the target machine.
  2. Validate the spec with devspark harness validate <spec.yaml>.
  3. Run a dry run first with devspark harness run <spec.yaml> --dry-run.
  4. Inspect the generated artifacts and the resolved spec.
  5. Execute a real run only after the adapter and validation behavior are what you expect.

For adapter-driven runs, prefer explicit adapters in the spec when reproducibility matters across machines. Use a saved adapter default when you want a machine-local convenience setting.

Full Unattended Lifecycle

full-cycle.harness.yaml chains all nine lifecycle steps (specify → plan → tasks → critic → analyze → tasks (remediate) → implement → create-pr → pr-review) as agent_task steps, with one human_gate before create-pr that --hands-off bypasses automatically. Unlike devspark run full-cycle (see Available Aliases above), this actually executes each step non-interactively via the chosen adapter's CLI.

devspark adapter doctor                                              # confirm write-capable adapter
devspark harness validate full-cycle.harness.yaml
devspark harness run full-cycle.harness.yaml --dry-run                # resolve without executing
devspark harness run full-cycle.harness.yaml --adapter claude_code --hands-off

Validation rules deliberately avoid hardcoding a feature directory path — none exists yet when the spec is authored — and instead resolve the most recently modified .documentation/specs/*/ directory at validation time (command.exit_code rules using $(ls -td .documentation/specs/*/ | head -1)).

Convergence-loop caveat: --hands-off automatically runs run_stage_revalidation_loop (max 3 passes) after the step loop completes, but as currently implemented this only re-checks run.findings bookkeeping — it does not re-invoke the critic/analyze prompts to regenerate fresh findings. In practice, full-cycle.harness.yaml gets one remediation pass per run (the remediate-gates step, which is /devspark.tasks re-run merging gates/critic.md + gates/analyze.md into a ## Gate Remediation task phase — see Gate Remediation Merge below). If findings remain open after that pass, the delivery-gate check should block create-pr readiness rather than silently proceeding; check no-change-explainer.md / max-pass-failure-report.md if a run reports unmet delivery status.

Gate Remediation Merge

/devspark.tasks detects whether tasks.md already exists. On a re-run (already exists), it merges the findings: YAML from gates/critic.md and gates/analyze.md — both already in the Shared Review Resolution Contract shape (finding_id, severity, description, recommended_action, execution_mode, status, outcome) — dedupes overlapping findings, sorts by severity, and appends a ## Gate Remediation task phase with tasks tagged (resolves: <finding_id>). /devspark.implement flips each referenced finding's status to resolved in the originating gate file as its task completes, so a subsequent /devspark.critic//devspark.analyze re-run reports fewer open findings instead of repeating the same ones. This is prompt-level behavior (in templates/commands/tasks.md and templates/commands/implement.md), not harness-runtime code — it works whether /devspark.tasks is invoked as a slash command, via devspark run, or via devspark harness run.

Hands-Off Troubleshooting

  • If run fails with write_incompatible_adapter, switch to a write-capable non-interactive adapter and rerun devspark adapter doctor.
  • If delivery_status is unmet, review no-change-explainer.md and ensure changes exist under src/ or test/.
  • If convergence fails after max passes, inspect max-pass-failure-report.md and resolve remaining findings manually before retrying — see the convergence-loop caveat under Full Unattended Lifecycle.

Relationship to the Prompt Workflow

The harness runtime does not replace BSW.DevSpark's prompt-first lifecycle.

  • use slash commands to define, plan, implement, review, and release work
  • use the harness runtime when you need repeatable terminal-driven execution and traceable run artifacts

That separation is intentional: prompt workflows manage human and agent collaboration, while the harness runtime executes declarative engineering flows.

Test Coverage

The harness runtime is covered by two kinds of tests, both under tests/:

pytest test modules (run via pytest tests/)

These use standard def test_* functions and are picked up automatically by the test runner.

File Tests What it covers
test_delivery_status_contract.py 2 Delivery gate logic: unmet when no src/ or test/ changes; met when src/ changes present
test_convergence_loop_contract.py 2 Finding state transitions (open, resolved, deferred); stage iteration record structure

Run: pytest tests/ -v

Runnable contract scripts (run directly via python)

These use a main() entry point and validate end-to-end CLI behavior through typer.testing.CliRunner or subprocess. CI runs them in the contract-validation job.

File What it covers
test_harness_validation_contract.py devspark harness validate — loads and validates a spec YAML against the schema
test_harness_spec_contract.py Spec model parsing, field validation, and constraint checking
test_harness_runner_contract.py Full harness run lifecycle — artifacts written, exit codes, retry and abort paths
test_harness_adapters_contract.py Adapter routing via agents-registry.json, step-level adapter resolution
test_adapter_doctor_contract.py devspark adapter doctor — normalized readiness states (ready, write_approval_required, write_incompatible, unavailable)
test_hands_off_lifecycle_contract.py --hands-off flag — write-incompatible adapter triggers abort; decision-packet.json and result.json artifacts created

Run individually: python tests/test_harness_runner_contract.py

Run all: python tests/test_harness_validation_contract.py && python tests/test_harness_spec_contract.py && python tests/test_harness_runner_contract.py && python tests/test_harness_adapters_contract.py && python tests/test_adapter_doctor_contract.py && python tests/test_hands_off_lifecycle_contract.py