Alert Stating VD Degraded | VD nonCritical | Array Disk Critical

Issue:
Virtual Disk # State: /dev/sd#: Degraded
Virtual Disk # Status: /dev/sd#: nonCritical
Array Disk # Status: Critical
Array Disk # Status: nonCritical

Cause:
The drive is beginning to fail and a dial home was sent out prior to an actual drive failure. This can be due to a number of reasons.

Solution:
Need to 'ssh' onto the correct host. For example: Component: sdw2 : sdw2

mkdir -p /data1/LCollect/Disk_Logs
cd /data1/LCollect/Disk_Logs

Run the following commands to collect the logs:
1. omreport storage vdisk controller=0

2. omreport storage vdisk controller=0 -fmt tbl | awk -F"|" 
'{print $3"|"$2"|"$4"|"$6"|"$7"|"$11"|"$12"|"$13"|"$15}' | egrep -v '\-\-\-\-|\|\|'

3. omreport storage pdisk controller=0

4. omreport storage pdisk controller=0 -fmt tbl | awk -F"|"
'{print $1"|"$2"|"$3"|"$4"|"$5"|"$9"|"$15"|"$18"|"$19}' |egrep -v '\-\-\-\-|\|\|'

5. omreport system esmlog

6. omreport system alertlog

7. omconfig storage controller action=exportlog controller=0 
(the output file will be in /var/log/ directory with the file name starting with "lsi_MMDD".log" ; 
where MM is the month and DD is the day)

8. cp /var/log/lsi_0407.log data1/LCollect/Disk_Logs

9. tar czvf Messages.tgz /var/log/messages*

How to package the log files for uploading:
1. cd (this will change the directory to the home directory for root user which is /root)
2. tar czvf Disk_Logs.tgz LCollect/*

Copy the tar bundle to mdw using 'scp' and log out of node: scp "Disk_Logs.tgz" mdw:/tmp
Using WinSCP or any scp/ftp client connect to mdw and login with root credentials. In the WinSCP GUI, change directory to /tmp and remote copy the "Disk_Logs".tgz file to local workstation. Attach the file to the Service Request.

How do I analysis the logs?

Untar the log files and change the directory:
tar zxvf Disk_Logs.tgz
cd data1/LCollect/Disk_Logs

Need to verify the logs to find out which drive is faulty:
1. omreport storage pdisk controller=0 -fmt tbl | awk -F"|" '{print $1"|"$2"|"$3"|"$4"|"$5"|"$9"|"$15"|"$18"|"$19}' |egrep -v
'\-\-\-\-|\|\|'
ID | Status | Name | State | Power Status| Failure Predicted| Capacity | Hot Spare| Vendor ID
0:0:8 | Ok | Physical Disk 0:0:8 | Online| Spun Up | No | 558.38 GB (599550590976 bytes)| No | DELL(tm)
0:0:9 | Critical | Physical Disk 0:0:9 | Online| Spun Up | Yes | 558.38 GB (599550590976 bytes)| No | DELL(tm)
0:0:10| Ok | Physical Disk 0:0:10| Online| Spun Up | No | 558.38 GB (599550590976 bytes)| No | DELL(tm)

2. omreport storage vdisk controller=0 -fmt tbl | awk -F"|" '{print $3"|"$2"|"$4"|"$6"|"$7"|"$11"|"$12"|"$13"|"$15}' | egrep -v
'\-\-\-\-|\|\|'
Name | Status | State| Layout| Size | Read Policy | Write Policy | Cache Policy | Disk Cache Policy
Virtual Disk 0| Ok | Ready| RAID-5| 48.01 GB (51548651520 bytes) | Adaptive Read Ahead| Write Back | Not Applicable| Disabled
Virtual Disk 1| Ok | Ready| RAID-5| 2,743.86 GB (2946202337280 bytes)| Adaptive Read Ahead| Write Back | Not Applicable| Disabled
Virtual Disk 2|Degraded| Ready| RAID-5| 48.01 GB (51548651520 bytes) | Adaptive Read Ahead| Write Back | Not Applicable| Disabled
Virtual Disk 3|Degraded| Ready| RAID-5| 2,743.86 GB (2946202337280 bytes)| Adaptive Read Ahead| Write Back | Not Applicable| Disabled

In the above command we need to check for Write Policy / Cache Policy changes.

3. grep 'Unexpected sense.*Sense key: 3 Sense code: 11' messages* | awk '{print $1,$2,$(NF-4)}' | sort | uniq -c
2 messages.1:Apr 2 0:5:10
45 messages.4:Apr 6 0:2:9
2 messages.4:Apr 27 0:0:9
4 messages:Apr 24 0:0:10

4. grep 'Unexpected sense.*Sense: 3\/11\/00' lsi_0501.log | awk '{print $1,$7,$8,$(NF)}' | sort|uniq -c
8 03/23/16 PD 0a(e0x20/s10) 3/11/00
8 04/06/16 PD 09(e0x20/s9) 3/11/00
167 04/07/16 PD 09(e0x20/s9) 3/11/00 <-- 3/11 error have occurred 167 times
7 04/27/16 PD 0a(e0x20/s10) 3/11/00
11 05/01/16 PD 0a(e0x20/s10) 3/11/00

5. grep Corrected lsi_0501.log | awk '{print $1,$5,$6,$7,$11,$(NF-2)}'| sort | uniq -c
3 03/23/16 110=Corrected medium error PD 0a(e0x20/s10)
158 04/06/16 110=Corrected medium error PD 09(e0x20/s9) <-- The corrected errors have going up to 158 which is not a good sign.
3 04/27/16 110=Corrected medium error PD 0a(e0x20/s10)
9 05/01/16 110=Corrected medium error PD 0a(e0x20/s10)

6. From the pdisk output we need to look for: Critical
0:0:9 | Critical | Physical Disk 0:0:9 | Online| Spun Up | Yes | 558.38 GB (599550590976 bytes)| No | DELL(tm)

ID : 0:0:9
Status : Critical
Name : Physical Disk 0:0:9
State : Online

After checking all the above we are good to replace bad drive.

Comments

Popular posts from this blog

GP - Kerberos errors and resolutions

How to set Optimizer at database level in greenplum

GP - SQL Joins