TNO064: The Realities of Running SONiC at Scale
Brett Likens, an infrastructure automation engineer at Amazon, discusses the realities of running SONiC (Software for Open Networking in the Cloud) at enterprise scale. The conversation covers SONiC's architecture, the business case for disaggregated networking, operational challenges, automation strategies, and lessons learned from deploying open-source network OS in a large enterprise environment.
Summary
Host Scott interviews Brett Likens, who has nearly 20 years of networking experience spanning ISPs, healthcare, Rackspace, Network to Code, OpsMill, and now Amazon's retail infrastructure team. Brett's current role involves deploying SONiC-based disaggregated networking in an enterprise context — distinct from AWS's hyperscale operations — focused on edge/access switching across Amazon's large retail organization.
Brett explains that SONiC originated at Microsoft roughly a decade ago for hyperscale data center use and has since evolved into a viable enterprise option. Two primary business drivers are pushing enterprise adoption: cost control (reducing recurring vendor fees and per-port costs at scale) and vendor diversity (gaining negotiating leverage and controlling one's own destiny by decoupling the OS from proprietary hardware).
The hardware ecosystem now includes white-box vendors like Edgecore, Wistron, and Celestica offering enterprise-friendly form factors. Feature parity across hardware largely depends on the underlying ASIC — primarily Broadcom (Tomahawk, Trident, Jericho families) and Marvell — and whether the SONiC community or a third party has implemented the corresponding Switch Abstraction Interface (SAI) drivers. Brett notes that companies like Avis and PL Vision have emerged to fill this development gap, particularly for organizations lacking in-house software expertise or Broadcom SDK access.
From an operational standpoint, Brett describes SONiC's management interfaces — CLI, REST API, and gNMI — as flexible enough to integrate with existing automation stacks without requiring a complete overhaul. The OS's containerized architecture enables nonstop forwarding capabilities during upgrades (warm/soft reboot), reducing planned maintenance impact. CI/CD pipelines are used for image builds, with approximately 70-80% of testing achievable in software via virtual SONiC, reserving physical lab testing for ASIC-specific behaviors.
Brett's team at Amazon effectively acts as an internal TAC, with direct access to source code enabling faster root cause analysis compared to waiting on vendor support. He acknowledges that SONiC is being used in enterprise-adjacent features like spanning tree and PoE that haven't had the same battle-testing as data center features, meaning some rough edges remain. The containerized architecture also enables edge compute use cases — collecting, aggregating, and shipping custom observability data directly from the switch.
On the organizational change management side, Brett stresses that transitioning to SONiC is harder than simply adding a new vendor; it represents a paradigm shift requiring software development capability, vendor relationship maturity, team training, and process re-engineering. He recommends this path primarily for organizations with 10,000+ switches, though he expects the viability threshold to decrease over time. His practical advice: start by running virtual SONiC in a lab, and acquire a compatible used switch (e.g., an old Arista) to build a real business case before committing at scale.
Key Insights
- Brett argues that SONiC's enterprise viability is a very recent development — the hardware form factors and feature sets have only recently converged to meet traditional IDF/access switching needs, making this an inflection point rather than a mature market.
- Brett claims that a significant amount of private SONiC development is happening inside large enterprises today that isn't publicly visible, suggesting broader adoption than the community's public activity implies.
- Brett explains that the SAI (Switch Abstraction Interface) creates a dependency on ASIC vendor SDK relationships, and that Broadcom SDK access is prohibitively expensive for most organizations — which is why third-party ecosystem companies like Avis and PL Vision have emerged to absorb that cost across multiple customers.
- Brett argues that Amazon's internal team acts as a functional equivalent to vendor TAC, and that having direct source code access compresses time-to-resolution because they can identify exactly which developer wrote a problematic feature and engage them directly.
- Brett states that approximately 70-80% of SONiC testing can be done in software via virtual SONiC, with physical lab testing reserved for ASIC-specific behaviors — and that 90%+ of real-world bugs are caught at the software layer anyway.
- Brett contends that the containerized architecture of SONiC enables edge compute use cases — such as locally aggregating AP statistics and shipping summaries — that are impossible on traditional fixed-function switch operating systems.
- Brett argues that transitioning to SONiC is an order of magnitude harder than adding a new traditional vendor, because it requires not just retraining teams but also auditing all tooling for hard-coded behavioral assumptions about how network OSes work.
- Brett suggests the economic threshold for SONiC to make business sense is currently around 10,000+ switches, but he expects this number to decrease as the hardware ecosystem and feature sets continue to mature.
Topics
Full transcript available for MurmurCast members
Sign Up to Access