Evaluation Framework

Deterministic test harnesses, DDIL simulation, red-team fault injection, HITL review, and the 10 evaluation dimensions

OMEN evaluation is inspired by the deterministic, multi-dimensional assessment methodology of DaScient ARES-E. The core principle: mission software must be continuously and verifiably trustworthy, not just tested once before release.

10 Evaluation Dimensions

#	Dimension	Key Question
1	Functional Correctness	Does the system do what it is supposed to do?
2	Mission Utility	Does it improve aircrew decision-making under operational conditions?
3	Performance	Does it meet startup, render, and response time targets?
4	DDIL Resilience	Does it remain useful when connectivity is degraded or absent?
5	Usability	Can aircrew operate it under realistic cognitive load?
6	Security	Does it resist adversarial inputs, unauthorized access, and tamper?
7	Interoperability	Does it correctly ingest and emit data across all supported protocols?
8	Maintainability	Can it be updated, debugged, and extended by a small team?
9	Auditability	Can every action, decision, and data transformation be traced?
10	Energy Efficiency	Does it stay within compute and thermal budgets on constrained hardware?

Test Harnesses

Location: evaluation/harnesses/

Deterministic, reproducible test harnesses run with fixed random seeds and synthetic data only.

Adapter Harness — `evaluation/harnesses/adapter_harness.py`

The AdapterHarness validates any BaseAdapter implementation against the adapter contract:

Test	What It Checks
`_test_adapter_id()`	Attribute exists and is a non-empty string
`_test_supported_formats()`	Returns a non-empty list of format strings
`_test_health_returns_data()`	`health()` returns a valid `AdapterHealth` object
`_test_schema_returns_dict()`	`schema()` returns a dictionary (JSON Schema)
`_test_validate_valid_samples()`	Valid test inputs pass validation
`_test_validate_invalid_samples()`	Invalid test inputs are rejected with error details
`_test_ingest_valid_samples()`	`ingest()` returns a list of canonical entities
`_test_provenance_populated()`	Every ingested entity carries provenance chain

Output: A ComplianceReport with adapter_id, pass/fail verdict, passed/failed test counts, failure details, and duration in milliseconds.

from evaluation.harnesses.adapter_harness import AdapterHarness

harness = AdapterHarness(
    adapter=my_adapter,
    valid_samples=[valid_geojson],
    invalid_samples=[invalid_geojson]
)
report = harness.run_compliance_tests()
print(f"Passed: {report.passed} ({report.passed_tests}/{report.total_tests})")

Engine Harness — `evaluation/harnesses/engine_harness.py`

The TestEngineHarness validates the Mission Engine’s core runtime:

Test	What It Checks
`test_plugin_start_stop_lifecycle()`	Register → start → verify running → stop → verify stopped
`test_event_dispatch_reaches_plugin()`	Events dispatched to started plugins are received
`test_fault_isolation()`	A crashing plugin does not block other plugins from receiving events

DDIL Simulation

Location: evaluation/ddil/test_ddil_scenarios.py

Network Impairment Profiles

Profile	Bandwidth	Latency	Packet Loss	Description
`full`	Unlimited	< 5 ms	0%	Full connectivity baseline
`degraded`	256 kbps	200 ms	2%	Degraded tactical link
`intermittent`	64 kbps	500 ms	15%	Intermittent SATCOM
`near-offline`	8 kbps	2000 ms	40%	Near-disconnected
`offline`	0	N/A	100%	Fully disconnected

Network impairment is applied using Linux tc netem (traffic control) in containerized test environments.

DDIL Test Scenarios

Scenario	Tests	Docker Required
`test_offline_mission_package_loads()`	Mission package structure loads without network	No
`test_staleness_detection()`	Entity older than TTL is marked stale	No
`test_fresh_entity_not_stale()`	Recently received entity is not marked stale	No
`test_network_impairment_degraded()`	Operation at 256kbps/200ms/2% loss	Yes
`test_link_recovery_delta_sync()`	Delta sync after link recovery produces correct state	Yes

Red-Team & Fault Injection

Location: evaluation/red-team/test_malformed_inputs.py

Adversarial tests that probe the system’s boundaries:

Malformed Input Tests

Test	Input	Expected Behavior
`test_cot_invalid_xml_rejected()`	Non-XML string	Validation rejects with error
`test_cot_wrong_root_element_rejected()`	`<data>` instead of `<event>`	Wrong root element caught
`test_cot_missing_point_rejected()`	CoT XML without `<point>`	Missing point child rejected
`test_geojson_wrong_type_rejected()`	Single Feature (not FeatureCollection)	Type validation fails
`test_geojson_invalid_json_string_rejected()`	Malformed JSON string	Parse error caught
`test_geojson_out_of_bounds_coordinates()`	lat > 90, lon > 180	Pydantic constraints fire; adapter skips defensively

Fault Categories

Category	Examples
Malformed inputs	Invalid XML, truncated CoT, out-of-bounds coordinates
Data poisoning	Spoofed track positions, forged threat overlays
Resource exhaustion	High-rate message injection, large payload flooding
Protocol abuse	Replay attacks, out-of-sequence messages
Schema drift	Future-version schemas the current adapter has not seen
AI adversarial	Adversarial prompts to AI summarization services

Human-in-the-Loop (HITL) Review

Review Gate Triggers

AI-generated route recommendations
Threat summarization outputs
Conflict resolution suggestions from the CAL
Any AI action with mission-level consequence

HITL Workflow

AI Service → [Generate Recommendation] → [HITL Queue]
                                              │
                                    Reviewer receives recommendation
                                    with provenance, confidence, and
                                    supporting evidence
                                              │
                                    ┌─────────┴───────────┐
                                    ▼                     ▼
                               [Approve]             [Reject / Modify]
                                    │                     │
                            Action proceeds         Action blocked;
                            with HITL stamp          feedback logged

Scenario Replay

Operational scenarios can be recorded and replayed deterministically:

{
  "scenario_id": "alpha-01",
  "description": "Single aircraft route with one NOTAM and two threat updates",
  "events": [
    {"t": 0, "type": "track_update", "source": "cot", "payload": "..."},
    {"t": 5, "type": "notam_received", "source": "notam", "payload": "..."},
    {"t": 12, "type": "threat_update", "source": "cot", "payload": "..."}
  ],
  "expected_state": { "..." : "..." },
  "pass_criteria": ["track_visible", "notam_overlay_active", "threat_corridor_rendered"]
}

Pass/Fail Criteria

Criterion	Threshold
No critical mission workflow breaks	0 failures in engine, CAL, map harnesses
Stable offline operation	All offline DDIL scenarios pass within resource budget
Data provenance preserved	100% of canonical entities carry complete provenance chain
UI usable under constrained conditions	Render time SLO met in degraded and intermittent profiles
No unapproved AI action	0 AI actions executed without HITL approval in governance-gated scenarios
Logging and telemetry intact	100% of auditable events produce a log entry

Running the Evaluation Suite

cd evaluation
pip install -r requirements.txt

# Run adapter and engine harnesses (no Docker required)
PYTHONPATH=.. pytest harnesses/ -v

# Run DDIL scenarios (requires Docker for network shim)
PYTHONPATH=.. pytest ddil/ -v --docker

# Run red-team tests (isolated environment)
PYTHONPATH=.. pytest red-team/ -v --isolated

# Run all tests (excluding Docker-dependent)
PYTHONPATH=. pytest engine/core/ examples/ evaluation/ -v -k "not docker"