Prepare an experiment

Setting up a Hive experiment requires three things:

A target — which files (or lines) the Hive is allowed to modify
An evaluator — a script that scores each candidate solution
A sandbox — the environment where evaluations run

Optionally, you can also provide natural language context to guide the search. The Hive can evolve arbitrarily large codebases, but like a human researcher, it performs best when its task is well-defined. Before starting, consider:

Is the code you want to improve clearly separable from the surrounding harness?
Do you have a quantifiable metric? If a higher score doesn’t mean a better algorithm, you’re measuring the wrong thing.
Would domain knowledge or directional hints help narrow the search?

Define the target

The Hive needs to know which files in your codebase it is allowed to modify. This is specified in the configuration YAML using the repo.target_code field:

repo:
  target_code: evolve.py

It is possible to specify multiple files to evolve, and specific line ranges to evolve too, see here. Everything outside these paths is treated as fixed infrastructure. This lets you isolate the algorithm or heuristic you want to improve while keeping the surrounding harness stable.

Write the evaluator

One of the most important parts to setting up a Hive experiment is defining an evaluator. Concretely, this is a Python script specified in the configuration YAML file as a local path relative to the codebase root.

repo:
  evaluation_script: path/to/evaluation.py

This script is run as python path/to/evaluation.py, and must print a single-line JSON object to stdout, e.g.,

{
    "output": {
        "fitness": 0.85,
        "feedback_summary": "Throughput improved by 23% over baseline"
    },
    "metainfo": "Success"
}

The Hive will optimize for an algorithm which maximizes the quantity reported under output.fitness. It is also possible to define an evaluator with multiple objectives which are wanted to be optimized simultaneously, e.g.,

{
    "output": {
        "fitness": {
            "speedup": 2.48,
            "accuracy": 0.94
        },
        "feedback_summary": "forward() accounts for 73% of total runtime; matmul on line 42 is the primary bottleneck"
    },
    "metainfo": "Success"
}

The Hive always assumes that objectives should be maximized. If you have a quantity you want to minimize, simply return the negative of that value in the fitness.

Field	Type	Description
`output.fitness`	number or object	Scalar value, or a dict for multi-objective
`output.feedback_summary`	string (optional)	Summary passed back to the agent
`metainfo`	string	`"Success"` on success; any other value indicates failure

Tips and tricks

The Hive optimizes for the score your evaluator returns — make sure that score faithfully represents what you actually want to improve.

Enforce correctness explicitly. Never assume the output of the evaluator is correct. You should always add tests or checks to guarantee the code produces correct results before scoring performance. Your evaluator should return a failure if the code produces incorrect results, regardless of performance.
Guard against reward hacking. The Hive will exploit any shortcut that inflates the score — caching results, short-circuiting logic, or producing hardcoded outputs. Build validation checks into your evaluator and treat unexpectedly large score jumps with skepticism.
Make evaluations deterministic. If your evaluator has stochasticity, run it multiple times and report the mean or median. Noisy scores make it harder for the Hive to distinguish genuine improvements.
Keep evaluations fast. The Hive iterates faster with quick feedback. Balance evaluation thoroughness with speed — consider if a lighter test suite provides sufficient signal compared to an exhaustive one.
Use relative scoring for timed benchmarks. Hardware performance can vary between sandboxes. If optimizing for runtime, instead of reporting raw times, run a baseline alongside each candidate and report the speedup factor.

Configure the sandbox

Once the evaluator is implemented, it is also important to consider what environment to run the code in. Each candidate solution is evaluated inside a sandboxed environment which can be customized in the configuration YAML as follows:

sandbox:
  base_image: python:3.9-slim
  setup_script: |
    pip install -r requirements.txt
  resources:
    cpu: "2"
    memory: "4Gi"
    shmsize: "1Gi"
    accelerators: a100-80gb:8
  evaluation_timeout: 60

Below, we briefly explain the most important fields and how to use them. See this page for a full list of available options.

base_image: The Docker image used as the base environment.
setup_script: Shell commands that run once when the sandbox is first created. Use this to install dependencies or download data.
resources: CPU, RAM, and shared memory allocated to the sandbox. GPUs can be added with the format <accelerator-name>:<num-gpus>.
Accelerator GPU
a100-80gb NVIDIA A100 80GB
a100-40gb NVIDIA A100 40GB
h100 NVIDIA H100
h200 NVIDIA H200
b200 NVIDIA B200
a10 NVIDIA A10
t4 NVIDIA T4
l4 NVIDIA L4
l40s NVIDIA L40S

If you don’t have a strict constraint on target hardware, allocate resources with some headroom. Resource exhaustion is reported as an evaluation failure, and high-performing solutions can consume more resources than expected.
evaluation_timeout: Maximum time in seconds before an evaluation is terminated.
Set this with some headroom — high-performing candidates can take longer than expected, and an overly tight timeout may discard good solutions.

Accelerator	GPU
`a100-80gb`	NVIDIA A100 80GB
`a100-40gb`	NVIDIA A100 40GB
`h100`	NVIDIA H100
`h200`	NVIDIA H200
`b200`	NVIDIA B200
`a10`	NVIDIA A10
`t4`	NVIDIA T4
`l4`	NVIDIA L4
`l40s`	NVIDIA L40S

Provide context

You can provide natural language context to guide the Hive’s search. This is specified in the prompt.context field:

prompt:
  context: |
    This is some sample context about the problem.
    It can be formatted like this to span multiple lines.

Use this to describe the problem domain, suggest directions to explore, or set soft constraints on approaches the Hive should avoid. If you have multiple distinct ideas you want the Hive to explore, list them in the prompt.ideas field:

prompt:
  ideas:
    - This is one idea to explore.
    - This is a different idea to explore.

On each iteration, the Hive will randomly sample one of these ideas to guide its approach. Finally, if there are any files in the codebase that provide important context to writing the desired algorithm, such as internal dependencies, these can be specified in the repo.additional_context field.

repo:
  additional_context: context/file.py,another/context/file.py

Getting started

Using the Hive

Define the target

Write the evaluator

Tips and tricks

Configure the sandbox

Provide context

Getting started

Using the Hive

Documentation Index

​Define the target

​Write the evaluator

​Tips and tricks

​Configure the sandbox

​Provide context

Define the target

Write the evaluator

Tips and tricks

Configure the sandbox

Provide context