Scratchpads and Recursive Decomposition: Local LLM Context Management for C2 Agent Generation

Adam Chester at SpecterOps recently published Disposable Tooling: Building LLM-Generated Mythic Agents from Prompt to Deployment. Great post, and it validated a lot of what we have been building independently at CyberDagger over the past several months. We have been working on the same core problem, using LLMs to generate C2 agents, but our constraint set is different. Everything runs locally. No cloud APIs, no frontier models, no API keys. Open-source coding models in the 30-70B parameter range running on our own hardware. And we are not limited to one framework. Our coding agent targets both Mythic and Adaptix C2, with profiles for implant development, C2 profiles, payload containers, and general coding.

This post is about the scaffolding we built to make that work.

Why Local

Three reasons:

OPSEC. C2 agent source code going through a cloud API is a risk calculus we would rather skip. The tooling, the TTPs, the target-specific customizations: we prefer those stay on hardware we control.
Cost. Generating a Mythic agent takes dozens of LLM calls. Local inference is free after hardware costs. No rate limits, no quotas, no billing surprises mid-engagement.
Speed of iteration. We can burn through 50 attempts at a command implementation without waiting on API round-trips or worrying about token budgets. When you are experimenting with prompt structures and tool configurations, that velocity matters.

The tradeoff is capability. A 30B local model at 64K effective context does not hold a candle to a frontier cloud model at 200K. So we have to be smarter about how the model uses its window.

The Problem

Our C2 development agent runs local LLMs against a knowledge base of agent source code: Mythic agents (Poseidon, Medusa, Apollo), Adaptix beacons and gopher agents, C2 profiles, and extender plugins. The agent reads reference implementations, extracts patterns, and writes new commands, implant features, or entire agent skeletons.

Most models we run support 128K context natively, and the inference layer will serve them at full width with appropriate context configuration. But bigger windows do not solve everything. Quality degrades as context grows: attention diffuses, the model loses track of patterns established 40K tokens ago, and throughput tanks as KV cache eats VRAM. On commodity local hardware running a 30B model, 128K context is technically possible but practically painful.

A typical task, “build a portscan2 command”, requires:

Search the KB for the existing portscan implementation
Read 4-5 reference files (server-side definition, agent-side code, shared utilities, type definitions)
Extract patterns: import paths, type names, registration boilerplate, response patterns
Write 3 files: server-side command definition, agent-side implementation, build stub

Steps 1-2 consume ~20K tokens of context. At 64K, that is a third of the window gone before generation starts. The model can still generate, but it starts confusing patterns from file 1 with patterns from file 4. Import paths get hallucinated. Type names drift. The files it writes do not compile together because it mixed up details that were correct individually but combined wrong.

Throwing a bigger context window at this is like giving someone a bigger desk. It helps with clutter, but it does not help them remember what was in the first document they read an hour ago.

The agent framework already has two tiers of memory management:

Long-term memory: cross-session facts about the project, persisted to disk
Context compaction: when history approaches the window limit, old tool results are elided and older messages are LLM-summarized

What is missing is the middle tier: working memory that survives compaction. The model reads a reference file, but when compaction fires, the file content is gone, summarized into “read portscan.go (7137 chars)”. The patterns the model needed from that file are lost.

Approach 1: Scratchpad (Working Memory)

The simplest fix: give the model a place to write notes that compaction cannot touch.

scratch_set("portscan_imports", `
  "github.com/MythicAgents/poseidon/.../taskRegistrar"
  "github.com/MythicAgents/poseidon/.../structs"
`)

scratch_set("response_pattern", `
  msg := task.NewResponse()
  task.Job.SendResponses <- msg
`)

Four tools: scratch_set(key, value), scratch_get(key), scratch_list(), scratch_delete(key). Notes persist to .forge/scratchpad.json and are re-injected as a system message after every compaction cycle.

The implementation is ~150 lines of TypeScript. The key design decisions:

Size limits: 4KB per note, 12KB total. The scratchpad has to fit inside the context window alongside the model’s current work. Unlimited notes would just recreate the original problem.
Re-injection point: Notes are injected as a user message immediately after compaction, so the model sees them before generating its next response.
Ephemeral scope: Cleared on /reset. This is working memory, not long-term storage.
Persistence to disk: Notes survive process crashes. The model can pick up where it left off.

Prompting the Model to Use It

The system prompt includes a workflow guide:

When you read a reference file, immediately scratch_set the patterns you’ll need later (imports, types, function signatures)
Continue reading other files. Compaction may fire and summarize earlier reads.
When you start writing code, scratch_get your saved patterns
The scratchpad survives compaction; the raw file contents don’t.

This is surprisingly effective. A 35B-class local model reliably uses scratch_set after reading reference files, and scratch_get when writing output. The model treats it like a clipboard, exactly the mental model we want.

Approach 2: RLM Sidecar (Recursive Decomposition)

MIT’s Recursive Language Models framework takes a fundamentally different approach: instead of managing context within one window, give each sub-task its own fresh window.

The model writes Python code in a REPL. Variables persist between iterations. When a sub-task is too large, it calls rlm_query(), which spawns a fresh LLM context with the full tool suite, runs the sub-task, and returns the result as a string.

# Read reference code (stored in a Python variable - it persists)
ref = read_file("kb/.../portscan.go")

# Extract patterns with Python string ops
imports = re.findall(r'import \(([^)]+)\)', ref)

# Sub-task: write server-side file in a fresh context window
server_code = rlm_query(f"""
Write server-side Go for 'portscan2'.
Use these imports: {imports}
Registration: agentstructs.AllPayloadData.Get("poseidon").AddCommand(...)
Return ONLY the Go source code.
""")

write_file("agentfunctions/portscan2.go", server_code)

We built this as a Python sidecar (rlm/sidecar.py) that the TypeScript agent spawns as a subprocess. JSON-line protocol on stdin/stdout. The sidecar wraps MIT’s RLM library with coding tools (read/write/edit/glob/grep/kb_search) sandboxed to the project directory.

The key insight: Python variables are the working memory. The model reads a file into ref, extracts imports with regex, passes them to rlm_query(). The sub-call gets a fresh context window with only the extracted patterns, not the 7K of raw source. Variables solve the context problem without any framework-level memory management.

Head-to-Head: portscan2

We tested both approaches on the same task: build a Mythic Poseidon agent command called portscan2 with CIDR expansion, port range parsing, concurrent scanning, and banner grabbing. Both runs used the same 35B-class local model.

Metric	Scratchpad	RLM Sidecar
Time	~5 min	52 sec
Iterations	25 tool calls	16 REPL iterations
Files written	4	2
Context usage	Hit compaction twice	13% of threshold
Server-side	Correct registration + types	Correct registration + types
Agent-side core	Correct (taskRegistrar, response pattern)	Correct (taskRegistrar, response pattern)
CIDR expansion	Yes	Yes
Port range parsing	Yes (with bounds check)	Yes
Concurrency	chan struct{} semaphore	chan bool throttle (256 hardcoded)
Banner grabbing	Yes	No
Streaming results	Yes (partial responses)	No
Extra parameters	timeout_seconds, max_concurrent, grab_banners	None (hardcoded)
Compile errors	None	1 (Python ternary syntax in Go)

What the Scratchpad Got Right

The scratchpad version was more complete and more correct. It included all 5 requested parameters (hosts, ports, timeout_seconds, max_concurrent, grab_banners), implemented banner grabbing, produced streaming results, and compiled clean. The model stayed in control of requirements because it was working within a familiar tool-calling flow.

The downside: it took 5 minutes and hit the reflect assessor repeatedly. The assessor (a second LLM pass that evaluates output quality) criticized the model for not including full code in response text, even though the code was correctly written to files. Those wasted reflect rounds accounted for ~60% of wall-clock time.

What RLM Got Right

RLM was 6x faster and used only 13% of available context. The recursive decomposition worked exactly as designed. The model read references into Python variables, extracted patterns, and generated code without context pressure.

The downside: it missed requirements. The model read the existing portscan implementation and essentially cloned it instead of building the enhanced spec. It did not use rlm_query() for sub-decomposition (the task fit in one window), hardcoded the concurrency limit, skipped banner grabbing entirely, and produced one line of Python syntax in Go code (if len > 0 else). The model’s Python REPL context leaked into its Go output.

What We Learned

Scratchpad wins on quality. RLM wins on speed. They solve different bottlenecks:

Scratchpad solves the retention problem: “I read 5 files and forgot the first one.” It is a simple, reliable mechanism that works with the model’s existing tool-calling behavior. Models naturally use it because scratch_set feels like saving a note, an intuitive action.
RLM solves the capacity problem: “This task needs 60K tokens of context but I only have 32K.” When each sub-task gets a fresh window, the aggregate context is unlimited. But the model has to write competent Python to orchestrate the decomposition, and it has to faithfully translate requirements through the sub-call boundary.

The real win is combining them. For tasks within a single window (most commands), use scratchpad. For tasks that genuinely need recursive decomposition (multi-file refactors, large feature implementations), use RLM. The agent exposes both as tools: scratch_set/scratch_get for quick notes, rlm_task for heavy decomposition.

Model quality still dominates. Both systems are scaffolding. A model that misses requirements will miss them regardless of context management. The Python-in-Go syntax error from RLM is not a framework bug. It is a model bug. The scratchpad version’s completeness came from the model being more careful in tool-calling mode, not from the scratchpad itself.

Prompt engineering transfers poorly to tool selection. We told the model “Use rlm_task for this” and it ignored us and did direct tool calls instead. Local models default to their trained tool-calling behavior. Making a model choose an unfamiliar meta-tool over familiar direct tools requires either removing the direct tools or much stronger prompting. This is a known issue with tool routing in agentic systems.

Validation is non-negotiable. Right now our validation is “does it compile and does the structure match the reference.” That is not enough. Multi-tier validation (mock server protocol checks, live execution testing, independent QA pass with a clean context) is on the roadmap. Adam’s three-tier approach in the SpecterOps post is a good reference architecture for this.

The Anti-Spin Fix

A small but important bug fix came out of this work. Our loop detector tracks tool call signatures to catch models that repeat the same action forever. After writing a file, the model naturally wants to read it back to verify. But if it had read that file earlier (to get reference patterns), the read signature was already in the “seen” set, and the detector flagged it as a repeated action, killing the loop.

The fix: after any write_file or edit_file, clear all read_file signatures from the seen set. A write changes the file, so reading it afterward is a new action, not a spin.

if (tc.name === "write_file" || tc.name === "edit_file") {
  for (const sig of executedSigs) {
    if (sig.startsWith("read_file ")) executedSigs.delete(sig);
  }
}

Three lines. Cost us an hour of debugging.

Architecture

What’s Next

Multi-tier validation: mock Mythic server for protocol checks, live execution testing, independent QA pass with clean context
Tuning the RLM decomposition prompt so the model actually uses rlm_query() for sub-tasks instead of doing everything in one window
Benchmarking across model families: a range of open-source coding models in the 30-70B parameter class. Which models benefit most from each approach?
Hybrid routing: automatically choose scratchpad vs RLM based on estimated task complexity
Full agent generation: scaling from single commands to full implant packages (builder, tasking handler, Dockerfile, the works) across both Mythic and Adaptix, all local

Bigger context windows help, but they are not free. KV cache costs VRAM, generation slows down, and attention quality degrades with distance. These techniques let local models work smarter with whatever context they have. The scratchpad costs almost nothing, and the RLM sidecar turns one overloaded window into many focused ones.

Thanks to Adam Chester (@xpn) at SpecterOps for the Disposable Tooling post. Good to see the community converging on these patterns from different angles.

Tags: AI red team local LLM C2 framework agent development Mythic Adaptix agentic AI

Previous: On-Host Inference: Autonomous AI … Next: Qihoo 360's Hypervisor Driver Ships a …