Evaluation Framework
Deterministic test harnesses, DDIL simulation, red-team fault injection, HITL review, and the 10 evaluation dimensions
OMEN evaluation is inspired by the deterministic, multi-dimensional assessment methodology of DaScient ARES-E. The core principle: mission software must be continuously and verifiably trustworthy, not just tested once before release.
10 Evaluation Dimensions
| # | Dimension | Key Question |
|---|---|---|
| 1 | Functional Correctness | Does the system do what it is supposed to do? |
| 2 | Mission Utility | Does it improve aircrew decision-making under operational conditions? |
| 3 | Performance | Does it meet startup, render, and response time targets? |
| 4 | DDIL Resilience | Does it remain useful when connectivity is degraded or absent? |
| 5 | Usability | Can aircrew operate it under realistic cognitive load? |
| 6 | Security | Does it resist adversarial inputs, unauthorized access, and tamper? |
| 7 | Interoperability | Does it correctly ingest and emit data across all supported protocols? |
| 8 | Maintainability | Can it be updated, debugged, and extended by a small team? |
| 9 | Auditability | Can every action, decision, and data transformation be traced? |
| 10 | Energy Efficiency | Does it stay within compute and thermal budgets on constrained hardware? |
Test Harnesses
Location: evaluation/harnesses/
Deterministic, reproducible test harnesses run with fixed random seeds and synthetic data only.
Adapter Harness — evaluation/harnesses/adapter_harness.py
The AdapterHarness validates any BaseAdapter implementation against the adapter contract:
| Test | What It Checks |
|---|---|
_test_adapter_id() |
Attribute exists and is a non-empty string |
_test_supported_formats() |
Returns a non-empty list of format strings |
_test_health_returns_data() |
health() returns a valid AdapterHealth object |
_test_schema_returns_dict() |
schema() returns a dictionary (JSON Schema) |
_test_validate_valid_samples() |
Valid test inputs pass validation |
_test_validate_invalid_samples() |
Invalid test inputs are rejected with error details |
_test_ingest_valid_samples() |
ingest() returns a list of canonical entities |
_test_provenance_populated() |
Every ingested entity carries provenance chain |
Output: A ComplianceReport with adapter_id, pass/fail verdict, passed/failed test counts, failure details, and duration in milliseconds.
from evaluation.harnesses.adapter_harness import AdapterHarness
harness = AdapterHarness(
adapter=my_adapter,
valid_samples=[valid_geojson],
invalid_samples=[invalid_geojson]
)
report = harness.run_compliance_tests()
print(f"Passed: {report.passed} ({report.passed_tests}/{report.total_tests})")
Engine Harness — evaluation/harnesses/engine_harness.py
The TestEngineHarness validates the Mission Engine’s core runtime:
| Test | What It Checks |
|---|---|
test_plugin_start_stop_lifecycle() |
Register → start → verify running → stop → verify stopped |
test_event_dispatch_reaches_plugin() |
Events dispatched to started plugins are received |
test_fault_isolation() |
A crashing plugin does not block other plugins from receiving events |
DDIL Simulation
Location: evaluation/ddil/test_ddil_scenarios.py
Network Impairment Profiles
| Profile | Bandwidth | Latency | Packet Loss | Description |
|---|---|---|---|---|
full |
Unlimited | < 5 ms | 0% | Full connectivity baseline |
degraded |
256 kbps | 200 ms | 2% | Degraded tactical link |
intermittent |
64 kbps | 500 ms | 15% | Intermittent SATCOM |
near-offline |
8 kbps | 2000 ms | 40% | Near-disconnected |
offline |
0 | N/A | 100% | Fully disconnected |
Network impairment is applied using Linux tc netem (traffic control) in containerized test environments.
DDIL Test Scenarios
| Scenario | Tests | Docker Required |
|---|---|---|
test_offline_mission_package_loads() |
Mission package structure loads without network | No |
test_staleness_detection() |
Entity older than TTL is marked stale | No |
test_fresh_entity_not_stale() |
Recently received entity is not marked stale | No |
test_network_impairment_degraded() |
Operation at 256kbps/200ms/2% loss | Yes |
test_link_recovery_delta_sync() |
Delta sync after link recovery produces correct state | Yes |
Red-Team & Fault Injection
Location: evaluation/red-team/test_malformed_inputs.py
Adversarial tests that probe the system’s boundaries:
Malformed Input Tests
| Test | Input | Expected Behavior |
|---|---|---|
test_cot_invalid_xml_rejected() |
Non-XML string | Validation rejects with error |
test_cot_wrong_root_element_rejected() |
<data> instead of <event> |
Wrong root element caught |
test_cot_missing_point_rejected() |
CoT XML without <point> |
Missing point child rejected |
test_geojson_wrong_type_rejected() |
Single Feature (not FeatureCollection) | Type validation fails |
test_geojson_invalid_json_string_rejected() |
Malformed JSON string | Parse error caught |
test_geojson_out_of_bounds_coordinates() |
lat > 90, lon > 180 | Pydantic constraints fire; adapter skips defensively |
Fault Categories
| Category | Examples |
|---|---|
| Malformed inputs | Invalid XML, truncated CoT, out-of-bounds coordinates |
| Data poisoning | Spoofed track positions, forged threat overlays |
| Resource exhaustion | High-rate message injection, large payload flooding |
| Protocol abuse | Replay attacks, out-of-sequence messages |
| Schema drift | Future-version schemas the current adapter has not seen |
| AI adversarial | Adversarial prompts to AI summarization services |
Human-in-the-Loop (HITL) Review
Review Gate Triggers
- AI-generated route recommendations
- Threat summarization outputs
- Conflict resolution suggestions from the CAL
- Any AI action with mission-level consequence
HITL Workflow
AI Service → [Generate Recommendation] → [HITL Queue]
│
Reviewer receives recommendation
with provenance, confidence, and
supporting evidence
│
┌─────────┴───────────┐
▼ ▼
[Approve] [Reject / Modify]
│ │
Action proceeds Action blocked;
with HITL stamp feedback logged
Scenario Replay
Operational scenarios can be recorded and replayed deterministically:
{
"scenario_id": "alpha-01",
"description": "Single aircraft route with one NOTAM and two threat updates",
"events": [
{"t": 0, "type": "track_update", "source": "cot", "payload": "..."},
{"t": 5, "type": "notam_received", "source": "notam", "payload": "..."},
{"t": 12, "type": "threat_update", "source": "cot", "payload": "..."}
],
"expected_state": { "..." : "..." },
"pass_criteria": ["track_visible", "notam_overlay_active", "threat_corridor_rendered"]
}
Pass/Fail Criteria
| Criterion | Threshold |
|---|---|
| No critical mission workflow breaks | 0 failures in engine, CAL, map harnesses |
| Stable offline operation | All offline DDIL scenarios pass within resource budget |
| Data provenance preserved | 100% of canonical entities carry complete provenance chain |
| UI usable under constrained conditions | Render time SLO met in degraded and intermittent profiles |
| No unapproved AI action | 0 AI actions executed without HITL approval in governance-gated scenarios |
| Logging and telemetry intact | 100% of auditable events produce a log entry |
Running the Evaluation Suite
cd evaluation
pip install -r requirements.txt
# Run adapter and engine harnesses (no Docker required)
PYTHONPATH=.. pytest harnesses/ -v
# Run DDIL scenarios (requires Docker for network shim)
PYTHONPATH=.. pytest ddil/ -v --docker
# Run red-team tests (isolated environment)
PYTHONPATH=.. pytest red-team/ -v --isolated
# Run all tests (excluding Docker-dependent)
PYTHONPATH=. pytest engine/core/ examples/ evaluation/ -v -k "not docker"