PCIe Fatal Surprise Link Down

Error:
00d2 20/07/16 13:12:49 CRT Bios Critical Interrupt #04 PCIe Fatal Surprise Link Down (00:02.2) 69 [a1 00 12]
00d3 20/07/16 13:12:49 MAJ Bios Critical Interrupt #05 PCIe Warn Receiver Error (00:02.2) 70 [a0 00 12]

Cause:
This is the PCI Bridge chip.
The issue can be caused by: Hardware Failure or Older BIOS Version

Solution:
Need to check the BIOS Version of problematic Node

[root@sdw4 ~]# syscfg -i
 System Configuration Utility Version 12.0 Build 9
 Copyright (c) 2013 Intel Corporation
System BIOS and FW Versions:

BIOS Version.............. XXXXXXX.XXX.XX.XX.XXXX

BMC Version
 Boot Code............... 01.17
 Op Code................. 1.17.4151

1. If Node is in an older BIOS version then its better to ask customer to upgrade the Node to latest one.
2. If in case the Node is already on Latest Version and still error persists, Need to collect below mentioned logs
a. If we are able to log to the Node
1. System Event Log
2. BIOS version
3. dca_profiler log
b. If Node is unreachable, issue below command from mdw
1. ipmiutil sel list -N sdw#-sp -U root -P <password> > /tmp/sdw#_sel.log

Note: Password V1 -- calvin, V2 -- sephiroth (If customer has changed the password kindly do get it from them)

After collecting logs, its better we have to replace the Node

Special Notes:
Two other recommendations:
Manually upgrade the BIOS and BMC on the failed server using instructions provided here:
https://emc--c.na5.visual.force.com/apex/KB_HowTo?id=kA0700000004W8j

If node is unreachable, power off and remove power cables. Then power on and perform the manual update.

Comments

Popular posts from this blog

GP - Kerberos errors and resolutions

How to set Optimizer at database level in greenplum

GP - SQL Joins