When a Network Switch Goes Dark for No Reason- Here's What's Actually Happening
Issue Reported by Customer
Device: AS9716-32D
Problem: Device became unreachable — SSH and console access failed.
Customer Action Taken
Power cycle performed
Device recovered and became reachable again
Customer Concern
Why did the device suddenly become unreachable?
Why was a critical temperature alert triggered at 66°C while the platform shows higher thresholds?
The logs told the story
During log analysis, we identified thermal warnings followed by a critical shutdown event.
Temperature Spike Logs
Feb 3 18:55:27 WARNING pmon#thermalctld: Temperature of CPU Package Temp changed too fast, from 32.0 to 66.0
Feb 3 18:55:27 WARNING pmon#thermalctld: Temperature of CPU Core 0 Temp changed too fast, from 32.0 to 66.0
Feb 3 18:55:46 CRIT - Monitor CPU Temp, temperature is 66.0. Temperature is over 66.0. Need shutdown DUT.
Immediately after this event:
Device stopped responding
SSH and console access unavailable
After the reboot: all clear
After the power cycle, system temperature returned to normal.
Platform Temperature Output
CPU Core 0 Temp: 31°C
CPU Core 1 Temp: 31°C
CPU Core 2 Temp: 31°C
CPU Core 3 Temp: 31°C
CPU Package Temp: 31°C
All sensors were within normal operating range.
Investigation Performed
We verified the following components:
Hardware Sensors
show environment
All temperature sensors were normal:
No abnormal readings were detected.
Key Observation
A discrepancy was noticed between:
Platform temperature thresholds vs Thermal shutdown policy
Platform Output
High Threshold: 82°C
Critical Threshold: 104°C
However, the actual thermal shutdown policy defines:
CPU Thermal Shutdown Threshold = 66°C
This explains why the device initiated shutdown at 66°C.
Lab Reproduction Testing
The RD team reproduced the scenario on a lab device with identical hardware.
Stress Test Results
CPU utilization: 99%
Cores under full load
Maximum temperature observed: 50°C
Even under heavy load, the system did not reach 66°C.
Possible Root Cause Scenarios
After analysis, two potential scenarios were identified.
Scenario 1 — Sudden CPU Load Spike
A rapid increase in CPU utilization could cause a temporary thermal spike.
Possible causes:
burst processing
system task spikes
unexpected process load
Scenario 2 — External Heat Source
Another possibility is external airflow heat impact, such as:
hot air from adjacent device
rack airflow imbalance
However, customer confirmed:
rack environment normal
fan status healthy
traffic load normal
Therefore the CPU spike scenario is considered more likely.
Identified Design Issue
A display inconsistency was discovered:
does not reflect the actual shutdown threshold.
This created confusion because:
Output shows critical = 104°C
Actual shutdown occurs at = 66°C
This has been reported to the RD team for correction.
Implemented Solution
To assist further debugging, we developed a system monitoring tool.
Fix to the Thermal Values
Monitoring Package
Package Name: systatmonit_1.0_amd64.deb
Installation Procedure
Upload package to the device and run:
Installation completes in 1–2 seconds and does not impact system functionality.
Monitoring Behavior
Once installed:
Fix the Thermal policy mismatch issue
System resource data logged every 5 minutes
Data included in support dump
If the issue occurs again, the collected logs will provide detailed diagnostic information.
Future Improvement
The monitoring utility will be integrated natively in a future software release.
This will allow:
automatic system monitoring
faster root cause analysis
better reliability
TL;DR
The device shut down because the CPU hit 66°C, which is the real shutdown threshold, even though the UI displays 104°C. A sudden CPU load spike caused the temperature jump. We fixed the display mismatch, deployed monitoring, and flagged the design issue for a permanent fix.


Comments
Post a Comment