TCG079: Why Your State File is Actually a Distributed Systems Problem
Malcolm Matalka argues that Terraform's value lies not in its HCL syntax but in its state management, which is fundamentally a distributed systems problem inadequately solved by file-based locking. He discusses how StateGraph reimagines infrastructure state as a database rather than a JSON file, enabling concurrent operations, better queryability, and solving the scalability issues that plague teams as they grow.
Summary
The podcast challenges the narrative that AI and new tools have made infrastructure-as-code obsolete, arguing instead that the real stickiness of IaC comes from state management, not the DSL syntax. Hosts and guest Malcolm Matalka explore how Terraform's single-blob state file protected by file locks creates a distributed systems coordination problem that becomes a bottleneck as teams and infrastructure scale. The current best practice of splitting state into smaller root modules is presented as a band-aid solution that partitions for availability but sacrifices unified visibility.
Matalka explains that the mismatch between file semantics and the reality of multi-actor infrastructure coordination is the root problem. When multiple teams and CI/CD pipelines need to operate simultaneously, the file-lock approach forces serialization, creating queues and delays. He argues that treating this as a proper distributed systems problem—rather than a tooling problem—reveals that the solution is moving from files to databases.
The discussion emphasizes that AI acceleration of code writing has exposed existing bottlenecks in infrastructure tooling. The friction of GitOps workflows, where developers can't test locally without separate pull requests and approvals, prevents LLMs from effectively iterating on infrastructure changes. Matalka argues that giving developers local Terraform-like experiences with a backend service managing permissions would unlock better AI integration while maintaining determinism and auditability.
StateGraph, Matalka's solution, deconstructs the monolithic state file into a relational database (PostgreSQL) where resources, instances, and state metadata live in separate tables. This enables querying across boundaries, concurrent transactions on non-overlapping resources, history tracking, and correlation between code and state changes. Instead of hard boundaries between separate state files, teams get soft, queryable boundaries. Real customer pain points are cited: waiting two days in queues because plans take 4+ hours for large infrastructure.
The conversation acknowledges that this approach requires running database infrastructure, making it a natural graduation path rather than a replacement for Terraform's lightweight starting point. The architecture deliberately avoids unnecessary complexity, using standard PostgreSQL and replication patterns rather than specialized graph databases, prioritizing simplicity, reliability, and customer trust.
About this episode
Malcolm Matalka joins William and Eyvonne to challenge the narrative that Infrastructure as Code (IaC) is dead. Malcolm argues that the real value of IaC was never the syntax, but state and governance. Together they examine whether the state was a file problem at all, or a distributed systems problem in a JSON costume. Episode<a class="excerpt-read-more" href="https://packetpushers.net/podcasts/the-cloud-gambit/tcg079-why-your-state-file-is-actually-a-distributed-systems-problem/" title="ReadTCG079: Why Your State File is Actually a Distributed Systems Problem">... Read more »</a>
Key Insights
- Matalka argues that the value of infrastructure-as-code has never been the HCL syntax itself, but rather the state layer that maps human intent to cloud identifiers and tracks infrastructure metadata.
- The standard best practice of splitting Terraform state into smaller root modules is presented as a workaround that sacrifices unified visibility and queryability in exchange for reducing serialization bottlenecks—a trade-off that becomes inadequate at scale.
- The core problem with Terraform's file-based state is that it applies file system semantics (single writer locks) to what is fundamentally a multi-actor distributed coordination problem, creating queues and preventing concurrent operations on non-overlapping resources.
- AI-assisted infrastructure development is currently hampered not by code generation speed but by friction in the feedback loop—GitOps workflows that require pull requests and separate approval steps prevent LLMs from iterating effectively on infrastructure changes.
- Moving infrastructure state to a relational database enables soft boundaries between infrastructure domains rather than hard boundaries, allowing teams to query infrastructure as a unified system while still enforcing appropriate access controls through transactions.
- StateGraph's architecture deliberately uses standard PostgreSQL with normal SQL queries and replication rather than specialized graph databases, reflecting a philosophy that well-established database patterns are more reliable and understandable than novel systems.
- The shift from Terraform's file-based state to database-backed state represents a natural graduation path for organizations rather than a replacement, as starting teams benefit from the simplicity of downloading a CLI while scaling teams encounter unmanageable serialization delays.
- Historical state tracking and correlation between code changes and infrastructure state changes become trivial queries in a database model but require external tooling, duct-tape solutions, or manual state archaeology in file-based systems.
Topics
Transcript
. AI is here. AI is here and Terraform's dead. Open Tofu dead, dead, dead, dead things. So reading the next wave of hot takes out there might have you believe that this is actually true. Well, here's a newsflash. The value was never this DSL thing that people see that we call HCL or Terraform Configuration Language. It was the state. And for like over a very long time, over a decade, we've been storing that state in a flat JSON file protected by a single lock. And then when it tips over at like broader scale, we feign surprise and pain, like all the soccer players you see when you watch the world cup, you know, and someone…
Full transcript available for MurmurCast members
Sign Up to AccessMore from The Everything Feed - All Packet Pushers Pods
NAN126: Fine-Tuning Open Source LLMs for Network Engineering
Edward Tuharu, founder of VXpert AI, discusses his career pivot from pursuing CCIE certification to building AI-powered NOC/SOC systems after recognizing the transformative potential of transformer architecture in 2022. He outlines the progression of AI technologies from prompting to RAG to fine-tuning to agentic systems, drawing parallels with networking protocol evolution and emphasizing the importance of domain-specific knowledge and fundamentals.
D2DO306: Platform Engineering in the Agentic Era (Sponsored)
Jad Elzane and Miles Gray from VMware by Broadcom discuss how platform engineering evolved from DevOps to address developer cognitive overload, and how Platform Engineering 2.0 must now accommodate AI agents as consumers alongside human developers, requiring new security guardrails and observability controls.
PP116: News Roundup—FortiBleed Reveals Password Cracking Is Alive and Kicking, Accenture Goes All-In on OT, and More
Jennifer Jabush and guest co-host Wolf Gerlich discuss major cybersecurity incidents including the SearchLeak Copilot vulnerability, the FortiBleed password-cracking infrastructure, North Korean NPM package compromises, and organizational acquisitions in the OT security space. They also cover concerns about age verification systems and a FIFA World Cup broadcast vulnerability involving weak client-side authentication.
HS137: Did AI Turn “Everybody Codes” into “Nobody Codes”?
John Attil-Johnson and John Burke discuss how AI coding tools have fundamentally changed the "everybody codes" strategy, arguing that while AI can generate code quickly, logical thinking and code comprehension remain essential skills. They contend that the focus should shift from teaching everyone to code to ensuring everyone can read code and think logically to catch AI-generated errors.
IPB202: How to Get Hands-On IPv6 Deployment Experience
Ed Horley interviews John, an experienced network engineer, about practical ways to gain hands-on IPv6 experience at home. They discuss consumer-grade IPv6 setups, multi-homing challenges, ULA addressing, NAT/masquerading trade-offs, and how working with multiple historical protocols informs modern IPv6 design thinking.