DIU OMEN Solicitation License

Evaluation Framework

Deterministic test harnesses, DDIL simulation, red-team fault injection, HITL review, and the 10 evaluation dimensions


OMEN evaluation is inspired by the deterministic, multi-dimensional assessment methodology of DaScient ARES-E. The core principle: mission software must be continuously and verifiably trustworthy, not just tested once before release.

10 Evaluation Dimensions

# Dimension Key Question
1 Functional Correctness Does the system do what it is supposed to do?
2 Mission Utility Does it improve aircrew decision-making under operational conditions?
3 Performance Does it meet startup, render, and response time targets?
4 DDIL Resilience Does it remain useful when connectivity is degraded or absent?
5 Usability Can aircrew operate it under realistic cognitive load?
6 Security Does it resist adversarial inputs, unauthorized access, and tamper?
7 Interoperability Does it correctly ingest and emit data across all supported protocols?
8 Maintainability Can it be updated, debugged, and extended by a small team?
9 Auditability Can every action, decision, and data transformation be traced?
10 Energy Efficiency Does it stay within compute and thermal budgets on constrained hardware?

Test Harnesses

Location: evaluation/harnesses/

Deterministic, reproducible test harnesses run with fixed random seeds and synthetic data only.

Adapter Harness — evaluation/harnesses/adapter_harness.py

The AdapterHarness validates any BaseAdapter implementation against the adapter contract:

Test What It Checks
_test_adapter_id() Attribute exists and is a non-empty string
_test_supported_formats() Returns a non-empty list of format strings
_test_health_returns_data() health() returns a valid AdapterHealth object
_test_schema_returns_dict() schema() returns a dictionary (JSON Schema)
_test_validate_valid_samples() Valid test inputs pass validation
_test_validate_invalid_samples() Invalid test inputs are rejected with error details
_test_ingest_valid_samples() ingest() returns a list of canonical entities
_test_provenance_populated() Every ingested entity carries provenance chain

Output: A ComplianceReport with adapter_id, pass/fail verdict, passed/failed test counts, failure details, and duration in milliseconds.

from evaluation.harnesses.adapter_harness import AdapterHarness

harness = AdapterHarness(
    adapter=my_adapter,
    valid_samples=[valid_geojson],
    invalid_samples=[invalid_geojson]
)
report = harness.run_compliance_tests()
print(f"Passed: {report.passed} ({report.passed_tests}/{report.total_tests})")

Engine Harness — evaluation/harnesses/engine_harness.py

The TestEngineHarness validates the Mission Engine’s core runtime:

Test What It Checks
test_plugin_start_stop_lifecycle() Register → start → verify running → stop → verify stopped
test_event_dispatch_reaches_plugin() Events dispatched to started plugins are received
test_fault_isolation() A crashing plugin does not block other plugins from receiving events

DDIL Simulation

Location: evaluation/ddil/test_ddil_scenarios.py

Network Impairment Profiles

Profile Bandwidth Latency Packet Loss Description
full Unlimited < 5 ms 0% Full connectivity baseline
degraded 256 kbps 200 ms 2% Degraded tactical link
intermittent 64 kbps 500 ms 15% Intermittent SATCOM
near-offline 8 kbps 2000 ms 40% Near-disconnected
offline 0 N/A 100% Fully disconnected

Network impairment is applied using Linux tc netem (traffic control) in containerized test environments.

DDIL Test Scenarios

Scenario Tests Docker Required
test_offline_mission_package_loads() Mission package structure loads without network No
test_staleness_detection() Entity older than TTL is marked stale No
test_fresh_entity_not_stale() Recently received entity is not marked stale No
test_network_impairment_degraded() Operation at 256kbps/200ms/2% loss Yes
test_link_recovery_delta_sync() Delta sync after link recovery produces correct state Yes

Red-Team & Fault Injection

Location: evaluation/red-team/test_malformed_inputs.py

Adversarial tests that probe the system’s boundaries:

Malformed Input Tests

Test Input Expected Behavior
test_cot_invalid_xml_rejected() Non-XML string Validation rejects with error
test_cot_wrong_root_element_rejected() <data> instead of <event> Wrong root element caught
test_cot_missing_point_rejected() CoT XML without <point> Missing point child rejected
test_geojson_wrong_type_rejected() Single Feature (not FeatureCollection) Type validation fails
test_geojson_invalid_json_string_rejected() Malformed JSON string Parse error caught
test_geojson_out_of_bounds_coordinates() lat > 90, lon > 180 Pydantic constraints fire; adapter skips defensively

Fault Categories

Category Examples
Malformed inputs Invalid XML, truncated CoT, out-of-bounds coordinates
Data poisoning Spoofed track positions, forged threat overlays
Resource exhaustion High-rate message injection, large payload flooding
Protocol abuse Replay attacks, out-of-sequence messages
Schema drift Future-version schemas the current adapter has not seen
AI adversarial Adversarial prompts to AI summarization services

Human-in-the-Loop (HITL) Review

Review Gate Triggers

  • AI-generated route recommendations
  • Threat summarization outputs
  • Conflict resolution suggestions from the CAL
  • Any AI action with mission-level consequence

HITL Workflow

AI Service → [Generate Recommendation] → [HITL Queue]
                                              │
                                    Reviewer receives recommendation
                                    with provenance, confidence, and
                                    supporting evidence
                                              │
                                    ┌─────────┴───────────┐
                                    ▼                     ▼
                               [Approve]             [Reject / Modify]
                                    │                     │
                            Action proceeds         Action blocked;
                            with HITL stamp          feedback logged

Scenario Replay

Operational scenarios can be recorded and replayed deterministically:

{
  "scenario_id": "alpha-01",
  "description": "Single aircraft route with one NOTAM and two threat updates",
  "events": [
    {"t": 0, "type": "track_update", "source": "cot", "payload": "..."},
    {"t": 5, "type": "notam_received", "source": "notam", "payload": "..."},
    {"t": 12, "type": "threat_update", "source": "cot", "payload": "..."}
  ],
  "expected_state": { "..." : "..." },
  "pass_criteria": ["track_visible", "notam_overlay_active", "threat_corridor_rendered"]
}

Pass/Fail Criteria

Criterion Threshold
No critical mission workflow breaks 0 failures in engine, CAL, map harnesses
Stable offline operation All offline DDIL scenarios pass within resource budget
Data provenance preserved 100% of canonical entities carry complete provenance chain
UI usable under constrained conditions Render time SLO met in degraded and intermittent profiles
No unapproved AI action 0 AI actions executed without HITL approval in governance-gated scenarios
Logging and telemetry intact 100% of auditable events produce a log entry

Running the Evaluation Suite

cd evaluation
pip install -r requirements.txt

# Run adapter and engine harnesses (no Docker required)
PYTHONPATH=.. pytest harnesses/ -v

# Run DDIL scenarios (requires Docker for network shim)
PYTHONPATH=.. pytest ddil/ -v --docker

# Run red-team tests (isolated environment)
PYTHONPATH=.. pytest red-team/ -v --isolated

# Run all tests (excluding Docker-dependent)
PYTHONPATH=. pytest engine/core/ examples/ evaluation/ -v -k "not docker"