AI for Control Systems¶

Introduction¶

Large-scale infrastructures—from scientific observatories and particle accelerators to data centers, energy grids, and public transport networks—depend on control rooms that coordinate heterogeneous subsystems under strict safety and availability constraints. Operators in these environments must synthesize information from distributed monitoring tools, configuration databases, operating manuals, and institutional knowledge held by domain experts, often under severe time pressure during incidents.

In practice, the knowledge required for fast and correct decision-making is fragmented across systems, people, and formats. Documentation is frequently outdated or incomplete, critical inter-system dependencies (“hidden couplings”) are poorly mapped, and escalation procedures rely on scarce expert availability [1]. When incidents occur, organizations resort to large-scale coordination efforts—crisis conferences with dozens of participants, lasting hours to days—because no single person or tool can assemble the full picture quickly enough. This coordination overhead directly translates into extended downtimes, elevated operational risk, and, in regulated sectors, compliance exposure.

At the same time, recent advances in large language models (LLMs) [2] and agentic AI systems have demonstrated impressive capabilities in natural-language understanding, code generation, and multi-step reasoning. However, deploying general-purpose LLMs in safety-critical control rooms raises fundamental concerns: their outputs are probabilistic, difficult to audit, and prone to confabulation [3]. Emerging regulation—notably the EU AI Act [4] and the NIS 2 Directive [5]—increasingly requires traceability, human oversight, and auditable decision paths for AI systems used in critical infrastructure. This creates a clear gap: operators need AI-assisted decision support that is fast and context-aware, but also verifiable, auditable, and deployable on premise.

We address this gap by developing an AI-based assistance system that consolidates operational knowledge from manuals, experiential expertise, and real-time data, and makes it accessible through a natural-language interface. The system is built on a locally operated multi-agent architecture with a verifiable workflow: fixed feedback loops ensure that every input and output remains traceable and reproducible. By translating probabilistic model outputs into comprehensible, validated action steps—checked against deterministic rules and, where available, a digital-twin sandbox—the system bridges the gap between modern AI capabilities and the stringent requirements of safety-critical environments.

The development and initial testing take place in the context of the Cherenkov Telescope Array Observatory (CTAO) [6] and the Gammapy analysis framework [7]. These scientific control-room environments share the key characteristics of industrial counterparts—distributed data sources, complex system dependencies, heterogeneous software stacks, and the need for auditable operations—while providing a representative pilot setting with full data access.

Operational challenges across industries¶

To validate and quantify the challenges described above beyond the scientific domain, we conducted a structured field study in the first quarter of 2026 covering thirteen industry sectors, the most important and relevant among them being: research and science, industrial automation and manufacturing, public infrastructure, aerospace and railway, facility management and real estate.

The findings were grouped into recurring cross-cutting themes with inductive category formation. We summarize the most prominent ones below.

Fragmented knowledge blocks fast, correct decisions¶

Across the majority of sectors, practitioners reported that the knowledge required for time-sensitive operational decisions is scattered across systems, people, and formats. Under pressure, this fragmentation becomes a coordination bottleneck. In enterprise IT, incident resolution escalates into crisis conferences (“war rooms”) with dozens of participants; time-to-fix ranges from hours to days, and major outages can carry damage in the range of hundreds of millions of euros. In aerospace, where documentation is extensive and spans decades, the dependency on IT-based documentation systems is absolute: when those systems fail, physical maintenance work halts even when the repair itself is complete. Across sectors, the core pain is consistent: the information needed to act exists somewhere, but cannot be found, trusted, or assembled fast enough.

Trust, verification, and determinism for AI¶

A recurring concern across sectors is that AI outputs in high-stakes environments must be verifiable, auditable, and bounded. Practitioners consistently ask not for more powerful AI but for more trustworthy AI. In industrial automation, verification of AI recommendations is reported as “not handled to full satisfaction yet”; in traffic infrastructure, the question of “whether AI is allowed to act autonomously” remains open; in rail, domain experts report that current AI tools give “very often completely wrong answers” on safety-management and certification questions. In aerospace, the industry’s ability to assess fleet-wide impact within hours after incidents—enabled by standardized documentation and safety-driven governance—serves as a gold standard for audit trails that AI systems should match. Notably, current foundation models are already powerful enough for many operational assistance tasks: they offer the speed, breadth of knowledge access, and ad-hoc attention capacity that human teams cannot match under time pressure. The adoption barrier is therefore not model capability but the absence of verification layers that make these capabilities safe to use in regulated environments.

Dependency mapping and digital ground truth¶

Operational systems have complex, evolving dependencies (infrastructure to services, machines to processes, systems to systems) that are poorly mapped or maintained. When dependency information is wrong or missing, alarms misroute, simulations fail, and changes cause cascading failures. In enterprise IT, configuration management databases are often described as poorly maintained, with outdated entries, and it is noted that “without high data quality, simulations are ineffective.” In automotive embedded systems, the organizational split between software and integration teams creates hard information boundaries: downstream teams can find signal interfaces but not the functional specifications explaining why or when upstream components generate signals—“the interface can be found; the functional relationship cannot.”

Simulation and preflight validation¶

Multiple sectors expressed the desire to test proposed changes or interventions against a realistic model before real-world execution. A representative example involved a fibre-optic cable cut where the redundant backup line had been incorrectly dimensioned, resulting in a complete two-day outage—a failure that simulation combined with continuously updated dependency documentation would have caught. Some organizations already operate physics-based digital twins that validate optimization strategies before writing control setpoints. At the same time, full simulation-based digital twins “do not scale economically” across many assets, pointing to the need for lightweight validation approaches.

On-premise deployment and regulatory constraints¶

Mission-critical infrastructure environments require on-premise or hybrid deployment behind strict firewalls and VPN boundaries, subject to regulatory frameworks such as KRITIS baseline protection [8] and NIS 2. Some operators maintain internal AI specifically to avoid exposing company data to public tools. At the same time, on-premise deployment introduces its own friction: for industrial integrators, the logistics of access across customer environments—months of daily logins, different remote-access methods, expiring accounts, and repeated IT requests—represent a significant operational burden. This tension between the necessity of on-premise deployment and its operational cost underlines the need for solutions that are designed from the ground up for constrained-connectivity environments.

Summary¶

The evidence confirms that the core challenge across all sectors is not a lack of AI capability but a lack of trustworthy, accessible ground truth: the knowledge needed to act exists somewhere but cannot be found, trusted, or assembled fast enough under time pressure. Current foundation models are already sufficiently powerful for the retrieval, reasoning, and generation tasks required in operational assistance—they provide the speed and breadth of knowledge access that human coordination cannot match. However, if the underlying documentation is incomplete, outdated, or structurally inconsistent, any AI system built on top of it—whether for retrieval-augmented generation or automated verification—will inherit and amplify those errors. Documentation quality is therefore the primary gating constraint for any downstream AI capability. At the same time, verification—not raw model performance—is the key differentiator that would enable deployment in regulated environments: practitioners need to confirm that AI recommendations are grounded in correct sources before acting on them. These findings directly inform the design requirements of the proposed solution described in the following section.

Towards solution¶

Based on the identified operational challenges, we propose a verifiable, on-premise AI copilot for control rooms and engineering teams in regulated, mission-critical infrastructure. The system reduces incident-resolution and change-approval time by grounding AI-assisted recommendations in operational documentation and live signals, surfacing dependency impacts (“hidden couplings”), and preflighting every proposed action in a deterministic sandbox before execution. The architecture is organized around three functional modules described below.

Ingest & Ground¶

A retrieval-augmented generation (RAG) layer [3] indexes operational documentation, namely procedures, configuration databases, monitoring logs, and engineering specifications, into a continuously updated knowledge base. Incoming operator queries are enriched with the top-$k$ relevant context snippets before being forwarded to the reasoning layer. Beyond basic vector-similarity retrieval, the layer can incorporate graph-based and hybrid retrieval strategies to capture structural relationships between documents and assets. Open-weight foundation models (e.g., Qwen [9], GPT-OSS [10]) ensure that the system can run entirely on premise, within firewalled environments, as required by NIS 2 and KRITIS regulations.

Validate & Sandbox¶

Every AI-generated recommendation—whether a procedure, configuration change, code artifact, or analysis result—passes through a validation pipeline before it is presented to the operator. Structured-output constraints and deterministic rule checks enforce format and policy compliance. Where available, proposed actions are executed in a digital-twin preflight environment (containerized sandbox) that mirrors the target system, catching dependency violations, misconfigured parameters, and cascading side effects before they reach production. All validation outcomes are recorded in an append-only audit log.

Assist & Approve¶

Validated recommendations are surfaced to the operator through a natural-language interface that presents the proposed action, the evidence trail, and the sandbox results. Human-approval gates ensure that no action is executed without explicit operator consent, maintaining the human-in-the-loop oversight mandated by the EU AI Act [4] for high-risk AI systems. Approval decisions, together with operator feedback, are logged and can be used for future model adaptation via reinforcement learning from human feedback once the verification layer is sufficiently mature.

Summary¶

This three-module design directly addresses the five operational pain points: it consolidates fragmented knowledge, provides verifiable and auditable outputs, maps and checks dependencies before execution (digital-twin preflight), enables lightweight simulation without requiring full-scale digital twins, and operates entirely on premise. Figure 1 contrasts the current situation with the proposed solution.

Operational workflow before and after introducing the verifiable AIcopilot. Top: current situation with manual triage, fragmentedknowledge, and compliance exposure. Bottom: proposed solution withretrieval-augmented context assembly, deterministic preflightvalidation, and an auditable decisiontrail.

Figure 1. Operational workflow before and after introducing the verifiable AI copilot. Top: current situation with manual triage, fragmented knowledge, and compliance exposure. Bottom: proposed solution with retrieval-augmented context assembly, deterministic preflight validation, and an auditable decision trail.

Existing prototypes¶

The proposed architecture is being implemented in two working prototypes developed and tested in the context of ground-based gamma-ray astronomy: (i) agent-based code generation for the Gammapy data-analysis framework [7, 1, 11], and (ii) LLM-assisted engineering of data models for the CTAO Array Control and Data Acquisition software (ACADA) [6, 12]. Both prototypes implement the core principle of the proposed solution—a validation-first, multi-agent workflow with fixed feedback loops that ensures traceable and reproducible outputs.

Architecture¶

The system follows a narrow-waist design in which local components handle file ingestion, configuration, and execution, while LLM-based agents focus on a single, well-defined deliverable—either executable analysis code or a strongly typed data model. The core loop operates as follows:

Context assembly. A system prompt encodes non-negotiable rules (e.g., “return exactly one complete Python script”; “import all dependencies”; “do not call interactive functions”). Domain-specific context is injected via retrieval-augmented generation (RAG) [3]: selected tutorials and documentation are indexed into a vector store and the top-$k$ relevant snippets are appended to the user prompt.
Generation. The model produces a single artifact—a Python script or a Pydantic [13] model class—via a constrained tool call that strips commentary and enforces the expected format.
Execution and validation. The artifact is executed in a sandboxed environment with controlled filesystem access, no network connectivity, and hard time limits. Validation checks include process exit code, abstract syntax tree (AST) parsing, structural requirements (e.g., presence of a BaseModel subclass), and optional domain-specific numerical checks with explicit tolerances.
Iterative repair. On failure, a compact error summary (traceback tail, missing imports, structural violations) is fed back to a dedicated repair agent. The loop repeats until validation succeeds or a configurable attempt budget is exhausted.

All iterations are persisted: prompts, message logs, generated artifacts, stdout/stderr, and validation outcomes are stored per run, creating an audit trail that makes successful generations reusable and failures diagnosable.

Gammapy agent¶

The Gammapy agent [11] targets everyday gamma-ray analysis tasks: observation selection, spectral extraction with reflected regions, three-dimensional binned analyses, and quick-look sky maps. The agent operates on H.E.S.S. DL3 public test data [14] mounted at a configured path and exposed via an environment variable. A minimal web interface (Streamlit) and a command-line interface provide access to the same core.

A benchmarking harness executes generated scripts in isolated environments, records iteration traces (including token counts separated into input, cached, output, and reasoning tokens), and applies task-specific validators. On per-source tasks (observation listing, reflected-region significance, spectral extraction), state-of-the-art reasoning models reached 100% pass rates; the most recent models showed faster convergence in terms of attempts to pass.

CTAgent for ACADA¶

CTAgent [12] addresses a different but structurally analogous problem: synthesizing executable Pydantic data models from heterogeneous CTAO specification documents (plain text, JSON, LaTeX, PDF, DOCX). A FileIngestor normalizes inputs to UTF-8 text; per-role expert agents are instantiated with AutoGen [15] and coordinated by an orchestrator that validates outputs via AST parsing and structural checks. A dedicated CodeImprover agent receives targeted error descriptions and returns corrected code. In initial tests with telescope-structure specification documents, the agent extracted naming conventions from tutorials, generated mock values conforming to the schema, and produced data that could be uploaded to the CTAO configuration database without manual correction.

Open-weight deployment¶

Both agents are designed to be backend-agnostic. In addition to proprietary cloud APIs, we operate and evaluate open-weight models—specifically Qwen [9] and OpenAI’s GPT-OSS series [10]—on the Helmholtz Blablador platform, which exposes an OpenAI-compatible API and allows researchers to run models locally under institutional control [16]. This on-premise capability is essential for the deployment scenarios identified in the field study, where data sovereignty and regulatory compliance preclude the use of cloud-hosted models.

From scientific prototype to infrastructure assistance¶

The two prototypes demonstrate the core principles that transfer to the broader infrastructure-assistance use case:

Verifiable outputs: every generated artifact is executed, validated, and logged—translating probabilistic model outputs into deterministic, auditable results.
Iterative repair with feedback: compact error summaries drive targeted corrections, reducing the reliance on model accuracy alone and improving convergence.
Domain grounding via RAG: retrieval of up-to-date documentation mitigates the staleness and confabulation problems inherent to foundation models.
On-premise readiness: open-weight model support enables deployment within firewalled, regulated environments.

The transition from scientific analysis agents to control-room assistance requires extending the ingestion layer to operational data sources (monitoring, Supervisory Control and Data Acquisition (SCADA) systems, configuration databases, and project-management platforms), broadening the validation layer to include domain-specific safety checks and digital-twin preflight simulations, and introducing human-approval gates aligned with operational governance.

Conclusion and outlook¶

We have presented an AI-based assistance system for control rooms in large-scale infrastructures, grounded in three complementary contributions: operational evidence gathered across several industry sectors, a proposed solution architecture based on verifiable, on-premise AI copilot modules, and working prototypes validated in scientific observatory environments.

The operational evidence confirms that the core challenge is not a lack of AI capability but a lack of trustworthy, accessible ground truth: knowledge fragmentation, stale documentation, and unmapped dependencies are universal pain points that were confirmed in 79% of interviews across 12 of the 13 sectors surveyed. Current foundation models are already powerful enough for the retrieval, reasoning, and generation tasks involved—they provide the speed and breadth of knowledge access that human coordination cannot match under time pressure. However, if the documentation these models draw on is incomplete or incorrect, retrieval-augmented generation will surface wrong context and verification will validate against wrong assumptions. Documentation quality is therefore the primary gating constraint, and verification—not model power—is the key differentiator for deployment in regulated environments. These findings directly shaped the proposed three-module architecture (Ingest & Ground, Validate & Sandbox, Assist & Approve), which prioritizes validation-first workflows, full auditability, and on-premise deployment over raw model performance.

The existing prototypes—comprising agents for code generation (Gammapy) and data-model synthesis (CTAgent/ACADA)—demonstrate that domain-aware, multi-agent systems with fixed feedback loops can already produce reliable, traceable outputs for non-trivial engineering tasks. Open-weight model support on the Helmholtz Blablador platform ensures that the system can operate within the constrained-connectivity, data-sovereign environments identified as a prerequisite for adoption.

Future work will proceed along three axes. First, we will extend the prototype to ingest operational data sources (monitoring telemetry, SCADA signals, configuration databases) and integrate digital-twin preflight validation for proposed actions. Second, we will deploy the system in a live pilot using the telescope infrastructure as a realistic testbed with full on-premise data access. Third, we will explore reinforcement-learning-based adaptation from operator feedback and curated approval traces, once the automated verification layer is sufficiently mature, to improve recommendation quality over time while keeping outputs grounded via retrieval-augmented generation.

The longer-term vision is an industry-agnostic platform for verifiable AI-assisted operations: a system that synthesizes relevant information, validates proposed actions in a sandbox, and produces an auditable trail—enabling organizations to deploy AI where it is currently forbidden due to liability, safety, and compliance constraints. A major challenge on this path is isolating the reusable, productizable modules that form the truly industry-agnostic foundation from the domain-specific connectors and validation rules that must be adapted per sector.

References¶

[1] Kostunin, Dmitriy et al., “AI Agents for Ground-Based Gamma Astronomy”, arXiv preprint arXiv:2503.00821, 2025.

[2] Wayne Xin Zhao and others, “A Survey of Large Language Models”, arXiv preprint arXiv:2303.18223, 2025.

[3] Lewis, Patrick and others, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.

[4] European Parliament and Council, “Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act)”, Official Journal of the European Union, 2024.

[5] European Parliament and Council, “Directive (EU) 2022/2555 on measures for a high common level of cybersecurity (NIS2)”, Official Journal of the European Union, 2022.

[6] Oya, Igor and others, “CTAO Array Control and Data Acquisition”, Proceedings of Science (ICRC2023), 2024. doi:10.22323/1.444.0202

[7] Donath, Axel and others, “Gammapy: A Python package for gamma-ray astronomy”, Astronomy & Astrophysics, vol. 678, pp. A157, 2023. doi:10.1051/0004-6361/202346488

[8] Bundesamt fur Sicherheit in der Informationstechnik (BSI), “KRITIS-IT-Grundschutz-Profile”, 2022. https://www.bsi.bund.de/DE/Themen/Regulierte-Wirtschaft/Kritische-Infrastrukturen/Service-fuer-KRITIS-Betreiber/KRITIS-IT-Grundschutz-Profile/kritis-it-grundschutz-profile_node.html

[9] Yang, An and others, “Qwen3 Technical Report”, arXiv preprint arXiv:2505.09388, 2025.

[10] OpenAI, “GPT-OSS: Open-Source Models”, arXiv preprint arXiv:2508.10925, 2025.

[11] Kostunin, Dmitriy et al., “Agent-based code generation for the Gammapy framework”, Proceedings of Science (ICRC2025), 2025.

[12] Kostunin, Dmitriy et al., “Enhancing the development of Cherenkov Telescope Array control software with Large Language Models”, Proceedings of Science (ICRC2025), 2025.

[13] Colvin, Samuel, “Pydantic: Data validation using Python type annotations”, 2024. https://docs.pydantic.dev/

[14] H.E.S.S. Collaboration, “H.E.S.S. DL3 public test data release”, 2018. https://www.mpi-hd.mpg.de/HESS/pages/dl3-dr1/

[15] Wu, Qingyun and others, “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation”, arXiv preprint arXiv:2308.08155, 2023.

[16] Curdt, Constanze et al., “Forum Helmholtz Research Data Commons: Enhancing Research Data Workflows for and with AI”, 2025. doi:10.5281/zenodo.17265958