When a Network Switch Goes Dark for No Reason- Here's What's Actually Happening

 Issue Reported by Customer

Device: AS9716-32D
Problem: Device became unreachable — SSH and console access failed.

Customer Action Taken

Power cycle performed

Device recovered and became reachable again

Customer Concern

  • Why did the device suddenly become unreachable?

  • Why was a critical temperature alert triggered at 66°C while the platform shows higher thresholds?

The logs told the story

During log analysis, we identified thermal warnings followed by a critical shutdown event.

Temperature Spike Logs

Feb 3 18:55:27 WARNING pmon#thermalctld: Temperature of CPU Package Temp changed too fast, from 32.0 to 66.0

Feb 3 18:55:27 WARNING pmon#thermalctld: Temperature of CPU Core 0 Temp changed too fast, from 32.0 to 66.0

Feb 3 18:55:46 CRIT - Monitor CPU Temp, temperature is 66.0. Temperature is over 66.0. Need shutdown DUT.

Immediately after this event: 

  • Device stopped responding

  • SSH and console access unavailable

After the reboot: all clear

After the power cycle, system temperature returned to normal.

Platform Temperature Output

CPU Core 0 Temp: 31°C

CPU Core 1 Temp: 31°C

CPU Core 2 Temp: 31°C

CPU Core 3 Temp: 31°C

CPU Package Temp: 31°C

All sensors were within normal operating range.

Investigation Performed

We verified the following components:

Hardware Sensors

show environment

All temperature sensors were normal:

Component

Temperature

CPU Package

~32°C

PSU1

~30°C

PSU2

~38°C

Main Board

28–34°C

No abnormal readings were detected.

Key Observation

A discrepancy was noticed between:

Platform temperature thresholds vs Thermal shutdown policy

Platform Output

High Threshold: 82°C

Critical Threshold: 104°C

However, the actual thermal shutdown policy defines:

CPU Thermal Shutdown Threshold = 66°C

This explains why the device initiated shutdown at 66°C.

Lab Reproduction Testing

The RD team reproduced the scenario on a lab device with identical hardware.

Stress Test Results

  • CPU utilization: 99%

  • Cores under full load

Maximum temperature observed: 50°C

Even under heavy load, the system did not reach 66°C.

Possible Root Cause Scenarios

After analysis, two potential scenarios were identified.

Scenario 1 — Sudden CPU Load Spike

A rapid increase in CPU utilization could cause a temporary thermal spike.

Possible causes:

  • burst processing

  • system task spikes

  • unexpected process load

Scenario 2 — External Heat Source

Another possibility is external airflow heat impact, such as:

  • hot air from adjacent device

  • rack airflow imbalance

However, customer confirmed:

  • rack environment normal

  • fan status healthy

  • traffic load normal

Therefore the CPU spike scenario is considered more likely.

Identified Design Issue

A display inconsistency was discovered:



does not reflect the actual shutdown threshold.

This created confusion because:

  • Output shows critical = 104°C

  • Actual shutdown occurs at = 66°C

This has been reported to the RD team for correction.

Implemented Solution

To assist further debugging, we developed a system monitoring tool.

  1. Fix to the Thermal Values

  2. Monitoring Package

Package Name: systatmonit_1.0_amd64.deb

Installation Procedure

Upload package to the device and run:

Installation completes in 1–2 seconds and does not impact system functionality.

Monitoring Behavior

Once installed:

  • Fix the Thermal policy mismatch issue

  • System resource data logged every 5 minutes

  • Data included in support dump

If the issue occurs again, the collected logs will provide detailed diagnostic information.

Future Improvement

The monitoring utility will be integrated natively in a future software release.

This will allow:

  • automatic system monitoring

  • faster root cause analysis

  • better reliability


TL;DR

The device shut down because the CPU hit 66°C, which is the real shutdown threshold, even though the UI displays 104°C. A sudden CPU load spike caused the temperature jump. We fixed the display mismatch, deployed monitoring, and flagged the design issue for a permanent fix.


Category

Details

Issue

Device became unreachable

Trigger

CPU temperature reached 66°C

Root Cause

Thermal shutdown policy threshold

Confusion

Platform output displayed different threshold

Action Taken

Thermal values fix and Monitoring tool deployed

Preventive Measure

Continuous system resource logging




Comments

Popular posts from this blog

"AI Is Just Another Phase… Right?" 5 Myths About AI for NetOps

Scaling Deep Network Observability for 5G: Reflections from a Real Deployment

How Network Copilot Uses Agentic AI to Correlate FortiGate and Splunk