Every solo-serving peer now runs three synthetic chats back-to-back against its own llama-server with no mesh layers in the path, and gossips the median. The catalog shows the through-mesh number, the native number, and the ratio between them as a coloured "mesh efficiency" percentage — making the cost of the entry tunnel, auth gateway, and routing layer measurable on every row.
Phase 3.0 publishes a second throughput figure for every solo-serving peer: the rate that same peer's llama-server reaches when called directly, with no mesh layers between the prompt and the model. The catalog now carries both numbers for the same peer-model pair, plus the ratio between them, so the cost of routing a request through the mesh is measurable on its own terms instead of inferred from a single value.
When a peer's llama-server reports Ready on the solo path, the runtime now spawns a background collector. After a 30-second settle delay it issues a single deterministic streaming completion (temperature=0, seed=42, max_tokens=128) directly to 127.0.0.1:llama_port — no entry tunnel, no auth gateway, no routing layer. The result is timed using the same TTFT and decode-rate logic that records through-mesh samples (including the same wall-clock fallback Phase 1 installed when decode windows collapse near zero), persisted at ~/.closedmesh/native-baselines.json keyed by model file mtime, and gossiped via a new repeated field on the peer announcement. It refreshes every 12 hours or when the model file changes.
On the catalog at closedmesh.com/status, every model row now carries up to three throughput stats: the median through-mesh tok/s (from Phase 1), the median native tok/s (new), and a coloured "mesh efficiency" percentage — green at 80%+, amber 50–80%, red below 50%. The math is through ÷ native: 1.00 means the mesh path matches the peer's local stack, and the gap below 1.00 is the budget available for optimising the entry tunnel, auth gateway, and routing layer. Each percentage point reclaimed there will show up on the same catalog row that exposed it.
Scope: the collector runs on the solo-launch path. Pipeline-host and MoE-shard peers reach their local llama-server through iroh tunnels to remote rpc-servers, so the same single-port measurement would already include network overhead and stop being a native baseline. A second collector that captures the rpc-tunnel cost separately is queued as a follow-up once the solo ratios surface a gap worth attributing in those modes. Peers running pre-v0.66.49 advertise an empty baselines field; the catalog renders those rows as "no measurement yet," keeping the missing-data state visibly distinct from measured-and-slow.
What the new column immediately surfaced was a runtime bug, not a routing one. v0.66.49 had been launching llama-server with a fitter argument (-fitt) set to 70% of the device's VRAM. The runtime treated that value as a ceiling on llama.cpp's GPU usage, but llama.cpp treats it as the amount of memory it must leave free on the device. On an 18 GB M3 Pro (Metal pool ~12 GiB) the runtime was therefore telling the fitter to keep ~9.7 GiB free — leaving ~2.6 GiB for the model, forcing 22 of 36 layers of an 8 B Q4_K_M model to CPU repack, and pinning native decode at 0.74 tok/s. v0.66.50 (a same-day hotfix) fixed an unrelated gossip-refresh path that was clearing the new metric on legacy peers. v0.66.51 replaced the fitter formula with a small absolute margin (1–2 GiB regardless of device size), and the same M3 Pro now reports 8.95 tok/s native and 2528 ms TTFT — a 12.1× decode and 2.9× TTFT improvement on a single config change, on hardware nothing else changed about.
v0.66.52 then surfaced a second problem on top of the first: the published native number was being decided by single-shot variance. A controlled config sweep on the same M3 Pro ran the identical (model, args, prompt) three times back-to-back and returned 15.44 → 21.08 → 37.09 tok/s — a 2.4× spread on identical input, with the slowest sample tracking GPU command-queue and Metal-clock warmup rather than sustained capability. The 8.95 tok/s value the catalog had been publishing was an artefact of which 10-second window the collector happened to land in, not what the hardware could do. v0.66.52 raised the per-refresh sample count from 1 to 3, runs the samples back-to-back with a 1 s pause, and publishes the per-axis median; the on-disk cache and the gossiped `samples` field both record the actual N. The same M3 Pro now reports 21.07 tok/s native and 143 ms TTFT — a 2.4× decode and 18× TTFT correction on the published number, with no change to the underlying hardware or model.
Correction (2026-06-12): the coloured "mesh efficiency" percentage described above was removed from the catalog, because the ratio measured the wrong thing. For a solo serve the through-mesh and native numbers come from the same llama-server on the same hardware, so the decode rate is identical by construction and the gap between them was sampling noise, not a routing cost. The network's actual cost is first-token latency — the tunnel and routing layer add to TTFT, not to the per-token decode rate — so a throughput ratio that read "60% efficient" implied a generation-speed penalty that does not exist. A TTFT-based overhead figure would have been just as misleading: the native baseline is a short synthetic prompt while through-mesh samples come from real, longer prompts, so most of that gap is prefill length rather than mesh overhead. The catalog now shows the raw numbers — through-mesh tok/s, native tok/s, and fastest first token — and lets them stand without a derived ratio. Pooled-split serves remain the one case where through-mesh throughput is genuinely lower, because the model is split across peers over WAN; the row's topology label says so.