What Infrastructure Teams Already Know About Scaling AI

courtesy: Digital India

There is a useful exercise that happens occasionally at technology conferences, where the people who build and maintain the systems that everything else runs on are asked to identify what actually holds things back. The answers tend to be instructive, because infrastructure teams have a particular vantage point. They see what breaks. They see what gets stuck. They see the distance between what was designed on a whiteboard and what survives contact with production.

At the India AI Impact Summit 2026, a session on "The Engines of Intelligence: Scaling AI Infrastructure and Security" conducted exactly this exercise. Rahul Bhattacharya of EY Global Delivery Services, moderating a panel with Lakshmi Das of Prophaze, Clay Hoy of Arista Networks, and Mrinal Mathur of Arista Networks, asked the audience to raise their hands. What is the biggest barrier to scaling AI? Compute. Networking. Data pipelines. Security. Or organisation and operating model.

The infrastructure people chose the operating model.

This is worth sitting with. The people who build AI networks for a living, who manage GPU clusters and secure API endpoints and troubleshoot packet loss across data centres, believe that the primary obstacle to scaling AI is organisational. Who does what. Who owns the risk. How decisions get made about platforms, governance, data layers, and measurement. The technology, in their assessment, is solvable. The organisation around the technology is where deployments stall.

The Technology Is Moving Faster Than the Organisations Using It

The session provided a detailed and technically grounded tour of where AI infrastructure stands today. The picture is one of rapid capability expansion running ahead of organisational readiness.

Clay Hoy of Arista Networks described the physical reality of large-scale AI deployments. A 10,000-GPU cluster requires approximately 30,000 cables. If cable failure rates reach even 5%, effective bandwidth drops by half. The shift from single data centres to multi-data-centre architectures is happening by compulsion rather than design, driven primarily by power availability and cooling constraints rather than architectural preference. Ethernet is displacing proprietary protocols like InfiniBand, which matters because open standards reduce vendor lock-in and allow infrastructure teams to redeploy equipment across environments as needs evolve.

Mrinal Mathur, covering the India and SAARC market, noted that adoption in the region is slower than global benchmarks, partly because organisations are still making mistakes that other markets have already cycled through. The skills gap is significant: network engineers need to upskill for AI-specific networking, and existing data centres lack the power density that AI workloads require. States are competing with each other on incentives for data centre construction without necessarily having the power surplus to support what they are attracting.

Lakshmi Das of Prophaze brought the security perspective with operational specificity. During the 2023 state-sponsored attacks on Indian airports, Prophaze secured three of six major airports with zero downtime while competitors experienced 12 to 24-hour outages. A major financial institution, behind an established international security solution that could not scale, migrated to Prophaze in three hours during an active attack. Post-Operation Sindur, when Pakistani hacktivist groups targeted Indian critical infrastructure, Prophaze maintained continuous protection across airports, financial institutions, and refineries.

The common thread across all three perspectives was that the technology, while complex, is advancing on known trajectories. Ethernet is getting faster. GPU clusters are getting larger. Security models are evolving from signature matching to behavioural and anomaly detection using ML. The problems that remain unsolved are not in the machinery. They are in the organisations that deploy it.

Security cannot be a Gate at the End

Lakshmi Das made a point that resonated well beyond cybersecurity. Security, she argued, is always treated as an afterthought. Organisations move fast to build AI systems, fast to pilot them, fast to demonstrate capability. Security gets added at the end, as a gate before production. And then it fails, because security retrofitted onto a system designed without it is fundamentally fragile.

Her framing was architectural: security should be thought of "like a ship," where every structural element is designed for integrity from the initial stage. You cannot add watertight compartments after the hull is built. The same principle applies to AI systems, where the attack surface grows exponentially with scale. API endpoints proliferate. Shadow and zombie APIs accumulate. Prompt injection and silent data exfiltration can operate undetected for extended periods if behavioural monitoring is absent.

This principle, that critical capabilities must be designed in from the foundation rather than bolted on after the fact, extends far beyond security. Governance, compliance, data sovereignty, auditability, human oversight: each of these degrades when treated as a phase-two concern. Organisations that build AI systems without these capabilities embedded in the architecture consistently find that adding them later is more expensive, more disruptive, and less effective than building them in from the start.

The panel's discussion of data sovereignty reinforced this. The conversation around sovereign AI infrastructure in India is maturing rapidly. DPDP compliance, SEBI guidelines, RBI regulations, and the trusted telecom portal are creating a governance environment that demands architectural responses. Mathur's framing was pragmatic: sovereignty should mean transparent architecture, trusted supply chains, auditable network layers, security built into the fabric, and standards-based design that avoids lock-in. Das pushed further, arguing that telemetry data itself should remain within national borders and that dependence on foreign LLM providers creates structural vulnerability.

Both perspectives share an underlying insight: governance and sovereignty are architectural decisions. They are made, or fail to be made, at the design stage. Organisations that treat them as policy overlays on existing systems will find that the overlay never quite fits.

The Operating Model Gap

Bhattacharya, the moderator, offered perhaps the most consequential observation of the session. As a consultant who works with organisations going through AI scaling journeys, he identified the operating model as the persistent bottleneck. The questions that determine success or failure are not technical: what do we want AI to do? Why is it important? How will we measure value? Who has responsibility for what? What platforms are required? What governance is needed? Who owns the data layer?

These are organisational design questions. They require clarity about strategy, accountability, measurement, and decision rights. And they are precisely the questions that most AI deployments defer or distribute across too many stakeholders to be answered coherently.

The pattern is recognisable across sectors. A technology team builds a capable AI system. An IT team provides infrastructure. A legal team reviews compliance. A business unit defines the use case. Leadership approves the budget. And nobody owns the end-to-end operating model that would connect these functions into a coherent deployment. The pilot works because a small team can hold the whole picture in their heads. Production fails because the organisation cannot.

Hoy's observation about combining ML observability with network observability captures this in a microcosm. Today, he noted, these are done separately. The HPC team and the network team operate in parallel without shared visibility. As a result, when a job runs slowly, troubleshooting happens in the blind because nobody can see the full picture from GPU to GPU and everything in between. The technical solution is straightforward: combine the observability layers. The organisational solution is harder: get two teams with different reporting lines, different toolsets, and different professional identities to operate as one.

This is the pattern at every level of AI scaling. The technology is available. The organisational integration is not.

How Mitochondria Thinks About This

The operating model gap is where every Mitochondria engagement begins. Before any AI system is designed, the first question is organisational: how does this workflow actually function today, who is involved, what decisions are made and by whom, what information moves between which systems, and where do things break down?

This is what our Stimuli phase maps. Not the documented process, not the org chart version of how things work, but the actual operational reality. The insight from this session, that infrastructure teams already know the barrier is organisational, aligns precisely with what we observe in every sector we work in. Manufacturing companies where evaluation knowledge lives in the heads of three senior engineers. Financial services firms where compliance review and customer interaction happen in parallel but disconnected systems. Agricultural organisations where farmer data exists across multiple backends but has never been connected into a coherent intelligence layer.

ATP, our Autonomous Task Processor framework, is designed around this understanding. The architecture is cloud-based, interfacing with client systems via API. No data is stored on our side. Processing is transient, encrypted in transit, and compliant with GDPR and DPDP frameworks. Security and governance are structural, built into the design from the first conversation, because we have seen what happens when they are added later.

But the architecture is only the second conversation. The first conversation is about the operating model. Who will use this system? Who will oversee it? How will decisions made by the AI be reviewed, challenged, and improved? What metrics define success? What happens when the system encounters something it has not seen before? How does human expertise remain central as autonomy increases?

These questions are not technical but organisational. And they determine whether a deployment scales or stalls at pilot.

Our phased deployment approach, Stimuli through to Energy, is designed to resolve the operating model question progressively rather than requiring it to be answered perfectly before work begins. The system starts with a limited scope and full human oversight. As it demonstrates reliability, autonomy expands. At each stage, the operating model crystallises: roles become clear, measurement becomes concrete, and governance becomes habitual rather than imposed. The organisation learns to work with AI by working with AI, in controlled conditions where the cost of learning is low.

This is what Bhattacharya identified as the missing layer, and what the infrastructure professionals in the room already knew. The technology is ready. The question is whether the organisation is structured to let it work.

Observability as Operational Principle

The panel converged on a single investment priority: monitoring and observability. Hoy's formulation was direct: "If you can't see it, you can't troubleshoot it. And you're troubleshooting all the time." Mathur expanded this to the full architectural picture, arguing that the objective is minimum job completion time, and that observability is a prerequisite for identifying and removing bottlenecks across the system.

This principle applies well beyond GPU clusters and network fabrics. In any AI deployment, the ability to observe what the system is doing, what decisions it is making, where it is uncertain, and where it is failing, is the foundation of trust. Without observability, you cannot govern. Without governance, you cannot scale. And without scale, the non-linear returns that make AI investment worthwhile never materialise.

Every ATP deployment includes full interaction capture and structured logging, not as a compliance feature but as an operational principle. The system's decisions are visible, traceable, and auditable from day one. This serves immediate operational needs, allowing teams to identify issues and improve performance. It also serves the longer-term need for institutional confidence: the organisation can see what the AI is doing, and that visibility is what makes progressive autonomy possible.

The infrastructure teams at the India AI Impact Summit 2026 know this intuitively. They have spent careers learning that systems you cannot observe are systems you cannot trust, and systems you cannot trust are systems you cannot scale. The same principle applies at every layer of AI deployment, from the network fabric to the application layer to the organisational operating model that connects them.

The machinery is ready. The observation that matters most came from the people who build it: the work now is organisational.

Mitochondria builds ATP — agentic AI for operations. It learns your workflows, earns autonomy in stages, and runs with governance built in. Your data stays yours. Based in Amsterdam and Pune, working with organisations across Europe and India.

Previous
Previous

From Smart Ports to Thinking Ports

Next
Next

93% Confidence, 9% Architecture: The Real Barrier to Industrial AI