PCIe Fatal Surprise Link Down
Error:
This is the PCI Bridge chip.
The issue can be caused by: Hardware Failure or Older BIOS Version
Solution:
Need to check the BIOS Version of problematic Node
2. If in case the Node is already on Latest Version and still error persists, Need to collect below mentioned logs
a. If we are able to log to the Node
1. System Event Log
2. BIOS version
3. dca_profiler log
b. If Node is unreachable, issue below command from mdw
1. ipmiutil sel list -N sdw#-sp -U root -P <password> > /tmp/sdw#_sel.log
Note: Password V1 -- calvin, V2 -- sephiroth (If customer has changed the password kindly do get it from them)
After collecting logs, its better we have to replace the Node
Special Notes:
Two other recommendations:
Manually upgrade the BIOS and BMC on the failed server using instructions provided here:
https://emc--c.na5.visual.force.com/apex/KB_HowTo?id=kA0700000004W8j
If node is unreachable, power off and remove power cables. Then power on and perform the manual update.
00d2 20/07/16 13:12:49 CRT Bios Critical Interrupt #04 PCIe Fatal Surprise Link Down (00:02.2) 69 [a1 00 12] 00d3 20/07/16 13:12:49 MAJ Bios Critical Interrupt #05 PCIe Warn Receiver Error (00:02.2) 70 [a0 00 12]Cause:
This is the PCI Bridge chip.
The issue can be caused by: Hardware Failure or Older BIOS Version
Solution:
Need to check the BIOS Version of problematic Node
[root@sdw4 ~]# syscfg -i System Configuration Utility Version 12.0 Build 9 Copyright (c) 2013 Intel Corporation
System BIOS and FW Versions: BIOS Version.............. XXXXXXX.XXX.XX.XX.XXXX BMC Version Boot Code............... 01.17 Op Code................. 1.17.41511. If Node is in an older BIOS version then its better to ask customer to upgrade the Node to latest one.
2. If in case the Node is already on Latest Version and still error persists, Need to collect below mentioned logs
a. If we are able to log to the Node
1. System Event Log
2. BIOS version
3. dca_profiler log
b. If Node is unreachable, issue below command from mdw
1. ipmiutil sel list -N sdw#-sp -U root -P <password> > /tmp/sdw#_sel.log
Note: Password V1 -- calvin, V2 -- sephiroth (If customer has changed the password kindly do get it from them)
After collecting logs, its better we have to replace the Node
Special Notes:
Two other recommendations:
Manually upgrade the BIOS and BMC on the failed server using instructions provided here:
https://emc--c.na5.visual.force.com/apex/KB_HowTo?id=kA0700000004W8j
If node is unreachable, power off and remove power cables. Then power on and perform the manual update.
Comments
Post a Comment