TechnicalDiscussion

N4N057: The Art of Troubleshooting

Ethan Banks and Holly Podbilak discuss a structured methodology for network troubleshooting on the NS for Networking podcast. They cover steps from gathering information and recreating problems to using tools like AI, logs, and packet captures, while emphasizing the human elements of staying calm, working as a team, and documenting root causes.

Summary

In this episode of NS for Networking, Ethan Banks leads a detailed walkthrough of his troubleshooting methodology, prompted by a listener question. The episode begins with the foundational mindset principle: don't be defensive when a network problem is reported. Ethan argues that network failures are not personal failures, and that engineers must approach problems with open curiosity rather than ego-driven defensiveness. Holly Podbilak adds that even long-trusted vendors and technologies can fail, reinforcing the need for open-mindedness.

The first actionable step is to gather information thoroughly before attempting any fix. Ethan stresses that trouble tickets are often inaccurate, exaggerated, or speculative. Engineers must investigate the who, what, when, and where of a problem to understand its true scope. A common example discussed is users reporting 'the network is down' when in fact only a single application is unavailable. Holly reinforces this with her own experience of tickets describing 'network failure' where the user could still access the internet generally.

The next step is attempting to recreate the problem independently. Ethan explains that being able to reproduce the issue allows engineers to test solutions without depending on the affected user. However, he acknowledges that many problems are intermittent or environment-specific, making recreation difficult. He recommends simulating the user's environment as closely as possible, such as connecting via VPN if that is where the problem occurs.

A significant portion of the discussion focuses on discerning whether a problem is actually a network issue or an application/service issue. Ethan argues that most tickets that land on a network engineer's desk are not, in fact, network problems. He uses examples like HTTP error codes (401, 500) versus a website that never loads to illustrate how symptoms can reveal whether the problem is at the transport layer or higher up the stack. He encourages engineers to learn application-layer protocols like HTTPS to better make this determination.

The conversation then turns to tools and information sources available for troubleshooting. Ethan highlights diff reports (configuration change comparisons) as a high-value starting point, especially when problems emerge after a change window. He also advocates for AI-powered network management tools, noting that they can surface anomalies in log data far more efficiently than manual review. Holly corroborates this, sharing examples where AI dashboards identified problems like bad cables that manual review would have missed. Logs, SNMP-based network management systems, flow data (NetFlow/IPFIX), and physical observation are all discussed as complementary sources of diagnostic information.

Packet captures via tools like Wireshark or tcpdump are described as a last resort after more obvious sources of insight have been exhausted, though Ethan acknowledges their value, particularly in automated capture infrastructures triggered by network events.

The episode covers the critical discipline of changing only one thing at a time when attempting fixes, and reverting changes that don't resolve the problem. Ethan argues this practice is essential for building institutional knowledge, as it allows engineers to know precisely what fixed a problem when it recurs. He also recommends using whiteboards to step back from the details and walk through a systematic packet-flow diagram, especially under high-pressure conditions. Holly introduces the concept of 'rubber ducking' — talking through a problem out loud to another person (or even to oneself) to surface solutions organically.

The human and organizational dimensions of troubleshooting are given significant attention. Ethan discusses formal incident response structures with incident commanders, the importance of keeping managers and stakeholders informed, and the critical role of help desk staff who absorb frontline user frustration. He strongly advocates for team-based troubleshooting, arguing that collaborative environments without ego produce better outcomes. Both hosts acknowledge the emotional difficulty of troubleshooting under pressure, with Ethan sharing personal anecdotes of frustration-driven physical outbursts, ultimately advising engineers to stay calm and seek help rather than trying to be a solo hero.

The episode closes with reminders to perform root cause analysis, document findings, and make business-level recommendations for mitigating future recurrences — framing the cost of prevention against the risk of the problem happening again.

Key Insights

  • Ethan argues that the majority of tickets that reach network engineers are not actually network problems, but application or service issues that get routed to networking by default because no one else can identify the cause.
  • Ethan claims that trouble tickets are often inaccurate, exaggerated, or speculative, and should be treated as a starting point for investigation rather than an accurate description of the problem.
  • Ethan contends that network engineers who understand application-layer protocols like HTTPS are better equipped to demonstrate that a problem is not a network issue and hand it off appropriately to the responsible team.
  • Ethan argues that changing multiple things simultaneously during troubleshooting is counterproductive because it prevents engineers from knowing which specific change resolved the problem, degrading institutional knowledge for future incidents.
  • Ethan asserts that AI-powered network operations tools are now genuinely valuable and should not be dismissed by experienced engineers who default to CLI-based, old-school troubleshooting approaches.
  • Holly observes that by the time a problem reaches her as a vendor representative, basic triage has already been completed by multiple other teams, meaning the issues she sees tend to be genuinely esoteric or involve cross-vendor interoperability.
  • Ethan claims that diff reports — showing configuration changes between one day and the next — frequently serve as a smoking gun in troubleshooting, especially when problems emerge after a change window.
  • Ethan argues that engineers should not suspect technologies they don't understand well without specific evidence, because network components don't change on their own; blaming unfamiliar technologies without cause is an unproductive troubleshooting habit.
  • Ethan contends that troubleshooting is a team sport, and that the best outcomes occur when people with diverse experience collaborate without ego or political motivation to deflect blame.
  • Holly introduces the concept of 'rubber ducking,' arguing that simply verbalizing a problem out loud to another person — even one who doesn't understand the technology — often surfaces the solution organically during the act of explanation.
  • Ethan argues that fixing a problem is not the end of an incident, and that root cause analysis documentation is critical for reducing mean time to recovery in future recurrences — something he notes network engineers are typically poor at doing.
  • Ethan claims that maintaining emotional composure during high-pressure outages has a stabilizing effect on the broader team, and that engineers who lose their cool amplify stress across the organization rather than just experiencing it themselves.

Topics

Troubleshooting methodologyNetwork vs. application problem diagnosisAI-powered network management toolsLog analysis and diff reportsPacket capture and WiresharkChange management during troubleshootingIncident response and team dynamicsRoot cause analysis and documentationHuman factors in IT troubleshootingOSI model as a troubleshooting framework

Full transcript available for MurmurCast members

Sign Up to Access

Get AI summaries like this delivered to your inbox daily

Get AI summaries delivered to your inbox

MurmurCast summarizes your YouTube channels, podcasts, and newsletters into one daily email digest.