Troubleshooting Performance

This chapter is meant to be navigated. Start at the first question and follow the links.

Before you start:

use the LAT, BW, and MEM tabs from Adding the Console Monitor
if you need offline evidence, generate stats.json with log-stats

Start Here

Q1: Are you missing your target rate?

Q2: Are you actually rate-limited by config?

Check runtime.rate_target_hz.

Q3: Is one task clearly dominating the latency tab?

Q4: Can that task finish later without blocking every cycle?

Q5: Do you have a long CPU-bound pipeline?

Think: several pure compute stages back to back, little blocking I/O, little waiting on hardware.

Q6: Is the problem mostly end-of-CopperList overhead or late collapse?

Large end-of-cycle overhead right away, while task latencies are acceptable: Enable async-cl-io
The system runs well at first, then degrades over time: Enable mmap-fsync
Neither: Continue to Q7

Q7: Are you pushing too much data?

Check the BW tab and log-stats:

large serialized CopperLists
high disk write rate
one edge with large avg_raw_bytes or throughput_bytes_per_sec
Yes
No

Q8: Are you running out of buffers or in-flight CopperLists?

Pool pressure in MEM tab: Tune memory pools
Too few in-flight CopperLists for async or parallel execution: Increase CopperList slots
Neither: Reduce thread oversubscription

Outcomes

Rate Target Is the Limiter

If runtime.rate_target_hz is lower than the rate you want, Copper is doing exactly what you asked.

runtime: (
    rate_target_hz: 100,
),

If you need a higher rate:

raise rate_target_hz
or remove it and run at best effort

If the system starts missing deadlines after that change, go back to Q3 or Q5.

Enable `async-cl-io`

Use this when the task path is fine, but the end of the CopperList is too expensive.

What it does:

keeps the DAG execution on the main loop
queues the completed CopperList to a dedicated serializer thread
recycles the slot later when serialization finishes

Minimal feature forwarding:

[features]
async-cl-io = ["cu29/async-cl-io"]

cargo run --features async-cl-io

Good fit:

src -> task_a -> task_b -> sink

The LAT tab looks acceptable, but BW shows large serialized CL cost.

Bad fit:

src -> huge_cpu_task -> sink

The hot spot is inside huge_cpu_task itself. Moving CL serialization off-thread does not fix that.

Two checks after enabling it:

if the benefit is small, your real bottleneck is elsewhere
if the serializer now waits for free CopperLists, raise logging.copperlist_count

Enable `parallel-rt`

Use this when you have several CPU-bound stages back to back and want higher global throughput.

What it does:

keeps deterministic FIFO order per generated process stage
lets multiple CopperLists be in flight at the same time
turns the runtime into a stage pipeline instead of a strictly one-CL-at-a-time loop

Minimal feature forwarding:

[features]
parallel-rt = ["cu29/parallel-rt"]

cargo run --features parallel-rt

Good fit:

src -> cpu_a -> cpu_b -> cpu_c -> cpu_d -> sink

Why:

many stages can work on different CopperLists at the same time
no single stage is overwhelmingly larger than the others

Bad fit:

src -> giant_cpu_task -> sink

Why:

the giant task is still the throughput limiter
you usually get more by optimizing or parallelizing that task directly

Also a poor fit:

camera -> wait_for_io -> bridge -> sink

Why:

the graph is dominated by waiting and I/O, not by a CPU pipeline

Two checks after enabling it:

if workers look idle, you may need a larger logging.copperlist_count
if throughput does not move, the graph probably lacks enough useful overlap

Enable `mmap-fsync`

Use this when the system runs well for a while, then starts degrading, and you suspect the OS is accumulating too many dirty pages from the memory-mapped logger.

What it does:

keeps the memory-mapped logger
adds synchronous file sync_all() on section flush
trades peak throughput for more explicit writeback

Minimal feature forwarding:

[features]
mmap-fsync = ["cu29/mmap-fsync"]

cargo run --features mmap-fsync

Good fit:

The robot behaves well for a while, BW looks reasonable, then the machine starts stalling.

Bad fit:

The loop is already compute-bound from the first second.

After enabling it, measure the cost. If the throughput penalty is too large:

reduce what you log
experiment with section_size_mib and slab_size_mib
or accept the default async writeback and provision the storage path better

Background One Task

Use this when one isolated source or task is too slow, but it does not need to block the whole loop.

(
    id: "heavy",
    type: "tasks::HeavyTask",
    background: true,
),

What it means semantically:

Copper runs that source or task on the background threadpool
while it is still busy, process() returns None
downstream tasks therefore see missing output for some cycles

Good fit:

src -> heavy_optional_stage -> sink

Why:

it is acceptable for the stage to sample or skip intermediate cycles

Bad fit:

src -> estimator -> controller -> actuator

Why:

the controller path usually needs one coherent output per cycle

Copper will ensure a threadpool resource bundle exists when background sources or tasks are present. If you need a specific sizing, provide that bundle explicitly instead of relying on the default.

Parallelize Inside One Task

Use this when the real bottleneck is inside one task, not in the DAG structure.

Typical cases:

per-pixel or per-point computation
large reduction or map kernels
CPU-heavy loops that can be split cleanly

Preferred approach:

keep the DAG simple
add a thread pool in resources or use a controlled parallel loop inside the task
measure the task again in the LAT tab

Good fit:

src -> expensive_image_task -> sink

Bad fit:

The real issue is serialized logging or OS writeback.

Reduce Logging Volume

Use this when BW and log-stats show that the logger is simply carrying too much data.

The first lever is per-task logging:

(
    id: "fast-sensor",
    type: "tasks::FastSensor",
    logging: (enabled: false),
),

The second lever is global task logging:

logging: (
    enable_task_logging: false,
),

Good fit:

One or two high-rate edges dominate avg_raw_bytes and throughput_bytes_per_sec.

Bad fit:

The hot spot is a CPU-heavy task, but the logged payloads are small.

The practical rule is simple:

keep logging on for the edges you actually need for replay or diagnosis
turn it off for bulky intermediate traffic first

Increase CopperList Slots

Use this when async or parallel execution is underfilled because there are not enough preallocated CopperLists available.

logging: (
    copperlist_count: 8,
),

This value is consumed by the generated runtime, so it needs to be compiled into the binary.

Good fit:

async-cl-io or parallel-rt is enabled, but the runtime still behaves as if only a tiny number
of CopperLists can be active.

Bad fit:

The graph has no useful overlap to exploit.

More slots help only when there is real work to overlap.

Tune Memory Pools

Use this when the MEM tab shows pressure on pooled buffers or handles.

Good fit:

handle-backed payloads stay alive across several stages and the pool remains near full.

Actions:

increase pool size
reduce payload lifetime
reduce the number of simultaneously retained CopperLists if that is what keeps the buffers alive

Bad fit:

The pool is healthy and the real issue is CPU or disk.

Reduce Thread Oversubscription

Use this when you stacked too many concurrency mechanisms at once:

parallel-rt
several background: true tasks
task-local Rayon pools or custom worker pools

Symptoms:

throughput does not improve
jitter gets worse
the machine is busy, but the useful rate barely moves

The fix is usually to simplify first:

use parallel-rt for DAG-level overlap
use background: true for one isolated non-blocking stage
use task-local parallelism for one hot algorithm

Do not enable all three everywhere by default.

Keyboard shortcuts

Learning copper-rs