The Substrate Beneath the Substrate

Why AI re-physicalizes the platform, and what that means for the people who actually run it

There is a particular kind of slide that has been showing up in enterprise architecture decks for about ten years. It shows the cloud as an abstract layer at the bottom, with applications stacked on top, and arrows running between them. The slide is useful for explaining what platform engineering does. It is also, quietly, a lie.

The lie is not that the cloud isn't real. The cloud is extremely real. The lie is that the slide shows the cloud as a layer, when it is actually a building, in a specific place, full of specific machines, run by specific people who get specific calls at three in the morning when the air conditioning fails. The abstraction was useful enough that for a decade we let ourselves forget the buildings existed. We outsourced the buildings to Microsoft and Amazon and Google, paid the bills, and watched the architecture diagrams stop including any layer below "cloud."

The AI moment is making the buildings visible again, and most enterprise architecture functions are not ready.

This is the second piece in a series about how the AI moment is reshaping the enterprise architecture discipline. The first piece argued that the platform you built (your Paved Road, your Internal Developer Platform, your Landing Zone factory) is now one of two delivery tracks rather than the whole story, with a Modern AI Workplace track running alongside it and five horizontal fabrics shared between them. That essay made an honest gesture toward a layer underneath the fabrics, called it "the operated substrate," and then moved on. This is the piece where I stop gesturing.

What re-physicalization actually means

Bringing AI compute on-premises means choosing between two kinds of hardware, and they pose different problems. The first kind is the one the industry has been talking about for the last few years: rack-mounted GPU clusters, H100s or H200s or B200s in proper server chassis, in proper rooms, with proper power and cooling. This is a specific NVIDIA configuration in a specific rack in a specific facility, the kind of thing every enterprise IT shop knew how to talk about ten years ago and, after a decade of cloud migration, mostly stopped being asked about.

The second kind is genuinely new, and most architecture conversations haven't caught up to it yet. Devices like NVIDIA's DGX Spark are workgroup-class machines that run 70-billion-parameter models locally, but in a desktop form factor, on standard power, at workstation noise levels, sitting on or under a desk. You can chain a few of them together and put the chain in a project room. They are not racked. They are not in a controlled facility. They are not, by their physical profile, infrastructure in any traditional sense.

These two classes of hardware re-physicalize the platform in different ways, and the work of governing them is different too. The rack-mounted cluster looks familiar to anyone who has run a data center, even if the muscle memory has atrophied. The Spark-class device doesn't look familiar to anyone, because the category didn't exist before. The cloud-only narrative made the first kind invisible. The second kind is something the narrative didn't anticipate at all.

The re-physicalization isn't only about workgroup compute. It runs through three things at once.

The first is the workgroup tier itself, in both of the forms discussed above. If you go the rack-mounted route (an H100 or H200 or B200 cluster in a proper room with proper handling), the challenge is mostly one of remembering disciplines that lapsed during the cloud era. The cabinet, the cooling, the access control, the hardware refresh cycle, all of it is recoverable knowledge for any organization that ran a serious data center within the last ten years, even if that knowledge has been allowed to atrophy. The work is real, but it's familiar work.

If you go the Spark-class route, the challenge is something else. The unobtrusive physical profile is exactly what makes governance hard. There is no straightforward argument for locking a Spark in a cabinet, racking it in a controlled space, or treating it as different from the standard-issue PC three desks away, because physically it isn't different. It doesn't demand special power. It doesn't demand special cooling. The network requirements matter (a 70B-class inference workload talking to remote clients at meaningful throughput is a different shape than office traffic), but even those don't force a particular physical placement. The hardware will sit, governably or ungovernably, wherever the team decides to put it.

Which means the governance has to come from somewhere other than physical handling. Identity-bound access, ACL-as-code, lifecycle vending, audit trails, classification-aware routing, all the disciplines from the platform side, have to extend to a thing that looks like a workstation but acts like infrastructure. The teams that historically governed infrastructure governed it partly through physical control. The teams that historically governed workstations governed them partly through Intune and endpoint policy. A Spark fits neither category cleanly, and the work of figuring out which governance disciplines apply to which aspects of its operation is the work nobody has been funded to do yet.

The two options are not mutually exclusive. A serious enterprise AI strategy will probably end up with rack-mounted clusters in central facilities for the heaviest workloads, and Spark-class devices distributed to project teams for closer-to-the-work scenarios. Each demands a different kind of operational attention. The architecture function has to recognize both, and the operating teams have to be set up to handle a hybrid that didn't exist five years ago.

The second is endpoint NPUs. The current generation of laptops (the Copilot+ PCs, the M-series Macs, the AI-capable mobile workstations) have neural acceleration silicon that fundamentally changes what a corporate laptop can do. A 4-billion-parameter scout model running on an NPU is not the same animal as a Word document. It has performance characteristics, security implications, update cadences, and failure modes that the End User Computing team has not historically been responsible for. Whoever runs your endpoint estate is now operating an inference platform, whether they signed up for it or not.

The third is the network fabric between the two. A laptop in Lisbon talking to a Spark in Copenhagen needs latency that wasn't important when the laptop was just talking to Microsoft 365. The corporate WAN, the SD-WAN overlay, the VPN concentrator, the path through the carrier, all of it is suddenly part of the AI experience. If a remote worker's scout model takes 200 milliseconds to reach the heavy artillery in the office, the experience is fine. If it takes two seconds, the experience is broken, and the user finds a workaround that probably involves a SaaS chatbot and a copy-paste of NDA'd content.

None of these three are abstractions. All three are operated by specific teams who, in most organizations, were not consulted when the AI strategy was written.

The work that the slide hid

The cloud-only narrative had a particular cost that I want to name, because it has been sitting in the background of every "AI strategy" conversation I've been part of in the last year. The cost is that we trained two generations of architects to treat infrastructure as a resource you provision rather than as a substrate you operate. The platform engineering discipline, for all its real virtues, mostly inherited that framing. The Paved Road dispenses cloud subscriptions. The IDP automates pipeline configuration. The vending machine creates landing zones. All of these are provisioning activities, and they are all important, and none of them is the same thing as keeping the lights on.

The work that the slide hid is the operating work. It is the network engineer who notices that the path between two regions has degraded, and reroutes traffic before anyone files a ticket. It is the SRE who looks at the Splunk indexer cluster and realizes capacity will run out in six weeks unless something changes, and changes it. It is the storage admin who knows which array is approaching the end of its support life and quietly orchestrates the migration before it becomes a crisis. It is the identity team that, when the Entra ID outage hits a region, knows which downstream systems will fail and in what order. It is, in short, the work of making sure that the abstractions the architects rely on continue to be real.

This work has been undervalued for a long time. Some of that undervaluing was structural (the move from on-premises to cloud genuinely did automate a lot of routine operating tasks, and the people who used to do those tasks were retrained or moved on). Some of it was narrative (the platform engineering literature spent a decade celebrating the developer experience without spending much time on the operator experience that made the developer experience possible). And some of it was just generational, the way that every new layer of abstraction tends to forget the layer it sits on top of until something breaks.

The AI moment is the something that breaks.

What the AI moment specifically demands

The horizontal fabrics from the first essay (identity, data governance, observability, policy, vending) all sit on operated infrastructure that has to scale in new ways to support the second track.

The identity fabric has to handle a richer set of relationships. The IdP that used to answer "can Alice log in to Salesforce" now has to answer "can Alice's laptop reach the workgroup Spark, while running this specific model, on behalf of an agent that's calling another agent, right now". Same fabric, much more decision traffic. Not just user-to-application, but device-to-device, user-to-model, project-to-compute. A NetBird-style mesh with identity-bound ACLs is a different load on the IdP than a traditional VPN concentrator. Conditional Access policies have to evaluate posture for AI-related entitlements (is this laptop allowed to run a 7B model? is this device permitted to reach the workgroup Spark?) at a higher rate of decision than they did when the only question was "can this user log into Salesforce?"

The observability fabric has to ingest from new producers. Until recently the Splunk indexers had a predictable diet: sign-in logs, Defender alerts, application traces, the usual fare. They were sized for that. Now the same indexers are being asked to handle a firehose of new event types they weren't planned for. Endpoint inference logs, workgroup compute audit trails, model invocation telemetry, prompt provenance metadata. Capacity planning is a real conversation, not a slide.

The data governance fabric has to perform classification at a different latency. When a document is being classified for storage, classification can take seconds; nobody's waiting. When it is being classified for an inference routing decision (does this go to the endpoint scout, the workgroup heavy artillery, or the cloud LLM?), it has to take milliseconds, because the user is waiting on a response and the AI experience falls apart if the routing decision takes longer than the inference itself. The classification engine that lives comfortably in a batch pipeline now has to serve real-time queries. That's an operating problem, not a policy problem.

The policy fabric has to cover more enforcement points. Yesterday the policy story was tidy: Azure Policy enforces against cloud resources, Intune CSPs enforce against endpoints, and that more or less covered the estate. Today it doesn't. There are network ACLs in a mesh fabric, model entitlements in an endpoint runtime, and ingress rules on a workgroup Spark. Each enforcement point is operated by a different team. The policy-as-code discipline that worked when there was one enforcement point per resource type now has to coordinate across enforcement points that are not in the same control plane.

The vending fabric has to dispense different things. Provisioning used to mean "spin up a subscription, attach the right tags, hand it off". Now it also means handing someone a laptop that arrives with the right scout model already on it, or standing up a project tenancy on a Spark with the ACLs and data lifecycle policy baked in from day zero. The SVM dispenses cloud landing zones. The endpoint vending machine dispenses laptops with the right scout bundle, the right ZTNA profile, the right managed entitlements. The workgroup vending machine dispenses Spark project tenancy with ACLs and data lifecycle policy. All three are vending pipelines, but the operators of the underlying pipelines are different teams, and the pipelines themselves are operated services that someone has to keep running.

In each case, the architecture is recognizable. The operating load is new. And the operating load is not absorbed by the platform team. It is absorbed by the network engineers, identity admins, endpoint operators, observability engineers, and SREs who have been quietly keeping the lights on while the platform discipline was busy celebrating itself.

The architecture implication

There is a temptation, when you notice that operations matters, to add an "Operations" box to the architecture diagram and call the problem solved. This is the wrong move, for the same reason that adding a "Track 0" to a Two-Track diagram would be the wrong move. Operations is not a peer to the delivery tracks. It is the substrate the tracks depend on. Putting it in the diagram as if it were a peer flattens a layering relationship that actually matters.

The right move is to acknowledge the substrate as a layer, with a name, and to make the contracts between the substrate and the layers above it explicit. The umbrella from the first essay had principles at the top, fabrics in the middle, tracks at the bottom, and seams between the tracks. The honest version has a foundation underneath all of it: the operated infrastructure that makes the fabrics real. Network. Datacenters and on-premises compute estate. Identity stores. Observability platforms as products rather than as capabilities. Hardware lifecycle. Power and cooling. The mundane things that the cloud allowed us to forget.

The contracts between the substrate and the fabrics are where most of the interesting architectural decisions actually live. What's the SLO of the identity provider, and what does the policy fabric assume about availability? What's the maximum latency budget for a real-time classification call, and what does that mean for where the classification engine is hosted? What's the failure mode of the workgroup Spark when the office network goes down, and does the scout-and-heavy-artillery model degrade gracefully or fail catastrophically? Each of these is a contract. Each contract is a conversation between a fabric owner and a substrate operator. Each conversation is the kind of thing that, if it doesn't happen, gets discovered as an incident report eighteen months later.

The architecture diagram should make these contracts visible. The org chart should make their owners namable. And the work of operating the substrate should be funded as a peer to the work of building the tracks, because if it isn't, the tracks won't run.

What this means for the people doing the work

There is a category of professional in most enterprises who has been told, implicitly or explicitly, for the last decade, that their work is becoming less central. The network engineer who watched the WAN architecture get abstracted into SD-WAN. The identity admin who watched directory services move into the cloud. The systems administrator who watched physical servers disappear into virtualization, and then virtualization disappear into containers, and then containers disappear into serverless. The story they were told was: your work matters less now, because the abstraction has eaten it.

This was always partially false (someone has to operate the abstraction, and that someone is them) but it was emotionally true enough that the professional category mostly accepted it. The center of gravity in IT moved to platform engineering, to cloud architecture, to security architecture, to AI strategy. Operations became the thing you did to support the interesting work, rather than the interesting work itself.

The AI moment changes this. The second track, the Modern AI Workplace, is built on physical hardware in physical buildings connected by physical networks. The work of operating that estate is the work of making AI in the workplace real. The network engineer who knows how to size the office WAN for inference traffic is not legacy support. They are the person whose work determines whether the corporate AI strategy actually functions for remote workers. The identity admin who can model device-to-device authorization in NetBird is not a supporting role. They are the one who makes Zero Trust enforceable at scale. The systems administrator who can operate a workgroup Spark with proper lifecycle and audit is the one who lets project teams work with NDA'd data without leaking it to a hyperscaler.

This isn't sentimentality. It's an architectural fact. The discipline that has been treated as legacy is the one the AI moment most directly depends on. Any enterprise architecture function that doesn't recognize this and recruit accordingly will discover, in the next eighteen months, that its AI strategy is a set of slides without a substrate.

The honest parts

Three things I'm not certain about, and want to test against other people's experience.

The first is whether the framing "operations as substrate" lands the way I want it to, or whether it accidentally reinforces the hierarchical thinking that put operations under the platform in the first place. The point is not that operations is below the tracks in some pejorative sense. The point is that the tracks depend on operations the way a building depends on its foundation. Foundations are not less important than the buildings on top of them. They are the thing that lets the buildings exist. I think the substrate framing captures this, but I'm willing to be told otherwise.

The second is whether the network fabric for the AI workplace is actually a meaningfully different problem than the network fabric for cloud applications, or whether I'm overstating the change. My instinct is that the latency budgets are different, the traffic patterns are different (inference workloads burst differently than HTTP), and the trust model is different (device-to-device matters in a way it didn't), but I haven't operated this at scale and I'd rather hear from people who has.

The third is whether the cultural recovery I'm describing (operations work being recognized as central again) is actually happening, or whether I'm projecting onto the situation what I'd like to see. The platform engineering discipline has been very good at writing manifestos about its own importance. Operations has, on the whole, been worse at this. The reframing has to come from somewhere, and there's no obvious reason it has to come from the operators themselves rather than from architects who notice that the substrate is what makes their architectures real.

If you are doing this work, in any of the disciplines I've named, I'd like to hear from you. The substrate isn't going to claim its own seat at the AI strategy table. Someone has to invite it. This essay is one such invitation. The next pieces in the series go into the fabrics one at a time (identity first, because it's where everything else starts), and after that into the ZTNA decision, where the network and identity teams are going to have to figure out together what shape the future actually takes.

Christoffer Silversparre Jørgensen is a cybersecurity architect based in Helsingør, Denmark, writing about the intersection of security architecture, platform engineering, and the AI-era enterprise. The first piece in this series, The Two-Track Enterprise, introduced the umbrella frame this essay extends.

The Substrate Beneath the Substrate

Why AI re-physicalizes the platform, and what that means for the people who actually run it

What re-physicalization actually means

The work that the slide hid

What the AI moment specifically demands

The architecture implication

What this means for the people doing the work

The honest parts

Comments

More from this blog

The Two-Track Enterprise

Command Palette

Why AI re-physicalizes the platform, and what that means for the people who actually run it

What re-physicalization actually means

The work that the slide hid

What the AI moment specifically demands

The architecture implication

What this means for the people doing the work

The honest parts

Comments

More from this blog