On-Host Inference: Autonomous AI Decision-Making Inside the Implant

At the SANS HackFest Summit in Hollywood in 2023, an audience member asked a question that stayed with us: could offensive tooling make intelligent, autonomous decisions directly on the endpoint, without waiting for operator input or C2 round-trips?

This post is our answer.

Most C2 implants are operationally blind. They beacon on a fixed schedule, use a hardcoded channel, and apply the same evasion configuration whether they are running on a fully patched enterprise endpoint with leading commercial EDR or on an unmonitored lab VM. The operator compensates through manual tuning, adjusting sleep timers and swapping profiles to make decisions that the implant itself could make if it could observe the host.

We built an autonomous AI inference engine to close that gap: a compact inference stack that runs inside the implant with zero runtime dependencies, makes environment-aware operational decisions without a C2 round-trip, and remains fully observable and overridable from the operator console. Validated across Windows, macOS, and Linux on a live cyber range.

The Problem: Implants Operate Without Environmental Awareness

A beacon interval appropriate for a monitored enterprise endpoint will create inactivity-pattern anomalies on an unmonitored range VM if the operator does not tune it accordingly. Channel selection presents the same challenge: HTTPS is the correct default in environments with TLS inspection, but in environments without it, the overhead and traffic signature are unnecessary. Host triage is the more fundamental problem. Without intelligence about the value of the current host, the implant has no basis for deciding whether to persist, enumerate, or move on.

Defensive complexity compounds this further. Organizations calibrate their security stack to their risk appetite: a regulated enterprise may layer multiple endpoint detection solutions alongside network traffic analysis and user behavior analytics, while a development environment may rely on platform-native controls alone. The same static implant configuration cannot be simultaneously appropriate for both, and the gap between them is not predictable from outside the network.

The conventional response is manual operator intervention. Operators adjust sleep timers, swap channel profiles, and modify evasion configurations based on what they observe after the fact. This approach scales poorly under time pressure and across multiple simultaneous beacons. The round-trip to the operator introduces latency that matters operationally. Decisions that should take milliseconds take minutes.

The question we set out to answer was whether an AI model small enough to embed in an assessment payload, running on the host CPU with no GPU and no external calls, could make these decisions correctly.

Compiled Expertise, Not General Intelligence

The question that shaped this work was not how to add AI to an implant but what kind of AI is appropriate for a task like this.

Gary Klein’s research on expert decision-making under pressure established that experienced practitioners do not reason analytically when time and context are constrained. They recognize a situation based on accumulated experience and retrieve a prepared response. A senior red teamer observing a host with elevated monitoring activity during business hours does not enumerate options and compute expected values. They recognize the pattern and act. The computation takes milliseconds and draws on internalized experience, not conscious deliberation.

This is the cognitive model the inference layer implements. The decision logic is not a general reasoning engine. It is a mathematical encoding of specific expert judgment: when to beacon cautiously, which channel suits which environment, which host warrants persistent access. Operator expertise is distilled into the inference layer at build time, and at runtime the implant retrieves the right response for the observed situation rather than deliberating its way to one.

Small is a correct design choice for this kind of task, not a constraint to work around. The decision complexity of beacon timing is bounded: a handful of defensive posture signals map to a handful of possible actions, and a compact model has ample capacity for that decision space. A general-purpose language model operating on the same task brings orders of magnitude more capacity, but that excess buys language reasoning that an on-host implant must never perform. The latency would be measured in seconds, the memory footprint in gigabytes, and any network calls would be a detection event.

The procedural knowledge encoded here requires none of that. It requires mapping a feature vector to an action in well under a millisecond, with no external communication.

The cross-platform transfer result confirms the theory. Abstract judgment transfers across operating systems without retraining because it is genuinely abstract: elevated monitoring activity is a threat signal regardless of which platform generates it. Platform-specific technique selection does not transfer because it encodes concrete procedures tied to specific operating system internals. An experienced operator moving from Windows to macOS retains their abstract judgment about when to be cautious but must relearn platform-specific execution. The inference layer replicates this exactly: abstract decisions transfer, capability selection is per-platform.

Architecture: Capability Layer with Operator Override

The inference layer observes the host’s current defensive posture and produces decisions in the categories where adaptive behavior matters most: how aggressively to beacon, which channel to use, how to triage the current host, which evasion behaviors are appropriate, and whether the environment looks like an analysis sandbox. The observations are abstract security signals (presence of monitoring tooling, host role, privilege context, process environment, time of day) rather than OS-specific artifacts, which is what allows the same decisions to make sense across Windows, macOS, and Linux.

A doctrine layer sits above the inference layer and enforces phase-appropriate constraints. As the assessment progresses through operational phases, the doctrine layer gates which inference outputs are valid at each stage. The inference layer advises; the doctrine layer determines whether that advice is appropriate given the current phase.

The operator remains in command throughout. Override controls allow operators to set the operational phase, force specific outputs, or suppress any inference recommendation in real time. The inference layer informs operational behavior; it does not replace operator judgment. Every decision is visible in the operator console and can be countermanded at any point in the engagement.

The inference stack is embedded at build time. There is no model loading at runtime, no file I/O, and no external network call to make a decision. Inference completes well under a millisecond on host CPU, with no GPU dependency.

Training: Distilling Operator Expertise

The inference layer is trained by distilling operator expertise into a form the implant can carry. Experienced red-team judgment about defensive posture (what a fully patched commercial-EDR endpoint warrants versus what an unmonitored lab VM warrants, what a domain controller justifies versus what a developer laptop does not) is encoded as a deterministic training signal. The compact inference layer learns that signal directly, so at runtime it produces the operator’s own decision rather than an approximation of one.

The design principle is consistent across every decision the inference layer makes: encode the judgment in the training signal, then compress it to a form small enough to ship inside the payload with no runtime dependency.

Cross-Platform AI Generalization

The inference layer observes abstract defensive posture signals rather than OS-specific artifacts. A platform-specific collection layer translates the security mechanisms present on each operating system into a common representation, mapping equivalent controls to the same abstract dimensions regardless of which platform implements them.

Because the inference layer operates on this common representation without seeing OS-specific identifiers, abstract decisions deploy across all three platforms without retraining. Judgment trained against Windows defensive scenarios remains correct on macOS and Linux because the underlying security posture signals have valid equivalents on both.

Platform-specific capability selection, which evasion behaviors are safe in a given operating environment, does need per-platform training. Technique semantics differ meaningfully by operating system, so capability selection is trained for each platform independently.

The architecture cleanly separates what to decide (abstract, cross-platform) from how to implement that decision (platform-specific, per-platform trained).

Temporal Awareness

Point-in-time decisions cannot see patterns that only emerge over time. A host exhibiting a sustained increase in monitoring activity across multiple observation intervals is likely under active investigation, a pattern that warrants more caution than a single snapshot would indicate.

A temporal awareness layer adds sequence reasoning above the point-in-time inference. Its only authority is to pull beacon decisions toward more caution; it cannot make the implant more aggressive than the point-in-time inference alone would. That asymmetry is the safety property that matters: if the temporal layer is wrong, the cost is unnecessary caution rather than a detection event. Until sufficient context has accumulated, the point-in-time decision passes through unchanged.

Range Validation

We validated the inference stack across Windows, macOS, and Linux hosts spanning a variety of security configurations: fully patched systems running leading commercial EDR, domain infrastructure, and hosts with platform-native security controls only. Behavioral differentiation matched expected outcomes across every tested host profile. The implant correctly calibrated beacon posture and evasion behavior to the observed threat level on each host, and triage decisions reflected the relative value of the infrastructure encountered.

Live temporal behavior was observed during the deployment. The temporal awareness layer pulled beacon decisions toward more caution in response to accumulated context, then relaxed once the environment assessed as stable, with a second tightening event later on one host. The point-in-time decisions were stable throughout; the behavioral shifts were driven entirely by the temporal layer’s assessment of decision history. That is exactly the property the architecture was designed to produce.

For breach and attack simulation customers, on-host AI inference is available as an opt-in capability. Rather than executing static attack patterns, the on-host agent adapts its decisions in real time to the defensive environment it encounters, providing a more accurate measure of whether your security stack detects adaptive, intelligent tradecraft or only known signatures. Operators retain full visibility and override authority at every decision point. The AI augments the engagement; it does not run it.

Reach out to learn more or enable it for your next engagement.

Tags: research red-team AI machine-learning implant on-host-inference evasion CATM

Previous: Abayarde: A Compiled Language Built for …