TechnicalDiscussion

TCG079: Why Your State File is Actually a Distributed Systems Problem

Malcolm Matalka argues that Terraform's value lies not in its HCL syntax but in its state management, which is fundamentally a distributed systems problem inadequately solved by file-based locking. He discusses how StateGraph reimagines infrastructure state as a database rather than a JSON file, enabling concurrent operations, better queryability, and solving the scalability issues that plague teams as they grow.

Summary

The podcast challenges the narrative that AI and new tools have made infrastructure-as-code obsolete, arguing instead that the real stickiness of IaC comes from state management, not the DSL syntax. Hosts and guest Malcolm Matalka explore how Terraform's single-blob state file protected by file locks creates a distributed systems coordination problem that becomes a bottleneck as teams and infrastructure scale. The current best practice of splitting state into smaller root modules is presented as a band-aid solution that partitions for availability but sacrifices unified visibility.

Matalka explains that the mismatch between file semantics and the reality of multi-actor infrastructure coordination is the root problem. When multiple teams and CI/CD pipelines need to operate simultaneously, the file-lock approach forces serialization, creating queues and delays. He argues that treating this as a proper distributed systems problem—rather than a tooling problem—reveals that the solution is moving from files to databases.

The discussion emphasizes that AI acceleration of code writing has exposed existing bottlenecks in infrastructure tooling. The friction of GitOps workflows, where developers can't test locally without separate pull requests and approvals, prevents LLMs from effectively iterating on infrastructure changes. Matalka argues that giving developers local Terraform-like experiences with a backend service managing permissions would unlock better AI integration while maintaining determinism and auditability.

StateGraph, Matalka's solution, deconstructs the monolithic state file into a relational database (PostgreSQL) where resources, instances, and state metadata live in separate tables. This enables querying across boundaries, concurrent transactions on non-overlapping resources, history tracking, and correlation between code and state changes. Instead of hard boundaries between separate state files, teams get soft, queryable boundaries. Real customer pain points are cited: waiting two days in queues because plans take 4+ hours for large infrastructure.

The conversation acknowledges that this approach requires running database infrastructure, making it a natural graduation path rather than a replacement for Terraform's lightweight starting point. The architecture deliberately avoids unnecessary complexity, using standard PostgreSQL and replication patterns rather than specialized graph databases, prioritizing simplicity, reliability, and customer trust.

About this episode

Malcolm Matalka joins William and Eyvonne to challenge the narrative that Infrastructure as Code (IaC) is dead. Malcolm argues that the real value of IaC was never the syntax, but state and governance. Together they examine whether the state was a file problem at all, or a distributed systems problem in a JSON costume. Episode<a class="excerpt-read-more" href="https://packetpushers.net/podcasts/the-cloud-gambit/tcg079-why-your-state-file-is-actually-a-distributed-systems-problem/" title="ReadTCG079: Why Your State File is Actually a Distributed Systems Problem">... Read more &#187;</a>

Key Insights

  • Matalka argues that the value of infrastructure-as-code has never been the HCL syntax itself, but rather the state layer that maps human intent to cloud identifiers and tracks infrastructure metadata.
  • The standard best practice of splitting Terraform state into smaller root modules is presented as a workaround that sacrifices unified visibility and queryability in exchange for reducing serialization bottlenecks—a trade-off that becomes inadequate at scale.
  • The core problem with Terraform's file-based state is that it applies file system semantics (single writer locks) to what is fundamentally a multi-actor distributed coordination problem, creating queues and preventing concurrent operations on non-overlapping resources.
  • AI-assisted infrastructure development is currently hampered not by code generation speed but by friction in the feedback loop—GitOps workflows that require pull requests and separate approval steps prevent LLMs from iterating effectively on infrastructure changes.
  • Moving infrastructure state to a relational database enables soft boundaries between infrastructure domains rather than hard boundaries, allowing teams to query infrastructure as a unified system while still enforcing appropriate access controls through transactions.
  • StateGraph's architecture deliberately uses standard PostgreSQL with normal SQL queries and replication rather than specialized graph databases, reflecting a philosophy that well-established database patterns are more reliable and understandable than novel systems.
  • The shift from Terraform's file-based state to database-backed state represents a natural graduation path for organizations rather than a replacement, as starting teams benefit from the simplicity of downloading a CLI while scaling teams encounter unmanageable serialization delays.
  • Historical state tracking and correlation between code changes and infrastructure state changes become trivial queries in a database model but require external tooling, duct-tape solutions, or manual state archaeology in file-based systems.

Topics

Terraform state management as a distributed systems problemFile-based locking and serialization bottlenecks in infrastructure toolingMoving state from JSON files to database architecturesImpact of AI acceleration on infrastructure tooling frictionConcurrent operations and transaction isolation in state managementHistorical context and queryability of infrastructure changesStateGraph's database-based approach to stateScalability patterns for multi-team infrastructure coordination

Transcript

. AI is here. AI is here and Terraform's dead. Open Tofu dead, dead, dead, dead things. So reading the next wave of hot takes out there might have you believe that this is actually true. Well, here's a newsflash. The value was never this DSL thing that people see that we call HCL or Terraform Configuration Language. It was the state. And for like over a very long time, over a decade, we've been storing that state in a flat JSON file protected by a single lock. And then when it tips over at like broader scale, we feign surprise and pain, like all the soccer players you see when you watch the world cup, you know, and someone…

Full transcript available for MurmurCast members

Sign Up to Access

More from The Everything Feed - All Packet Pushers Pods

Get AI summaries like this delivered to your inbox daily

Get AI summaries delivered to your inbox

MurmurCast summarizes your YouTube channels, podcasts, and newsletters into one daily email digest.