Alert Stating VD Degraded | VD nonCritical | Array Disk Critical
Issue:
Virtual Disk # State: /dev/sd#: Degraded
Virtual Disk # Status: /dev/sd#: nonCritical
Array Disk # Status: Critical
Array Disk # Status: nonCritical
Cause:
The drive is beginning to fail and a dial home was sent out prior to an actual drive failure. This can be due to a number of reasons.
Solution:
Need to 'ssh' onto the correct host. For example: Component: sdw2 : sdw2
Copy the tar bundle to mdw using 'scp' and log out of node: scp "Disk_Logs.tgz" mdw:/tmp
Using WinSCP or any scp/ftp client connect to mdw and login with root credentials. In the WinSCP GUI, change directory to /tmp and remote copy the "Disk_Logs".tgz file to local workstation. Attach the file to the Service Request.
How do I analysis the logs?
Untar the log files and change the directory:
Need to verify the logs to find out which drive is faulty:
Virtual Disk # State: /dev/sd#: Degraded
Virtual Disk # Status: /dev/sd#: nonCritical
Array Disk # Status: Critical
Array Disk # Status: nonCritical
Cause:
The drive is beginning to fail and a dial home was sent out prior to an actual drive failure. This can be due to a number of reasons.
Solution:
Need to 'ssh' onto the correct host. For example: Component: sdw2 : sdw2
mkdir -p /data1/LCollect/Disk_Logs cd /data1/LCollect/Disk_LogsRun the following commands to collect the logs:
1. omreport storage vdisk controller=0 2. omreport storage vdisk controller=0 -fmt tbl | awk -F"|" '{print $3"|"$2"|"$4"|"$6"|"$7"|"$11"|"$12"|"$13"|"$15}' | egrep -v '\-\-\-\-|\|\|' 3. omreport storage pdisk controller=0 4. omreport storage pdisk controller=0 -fmt tbl | awk -F"|" '{print $1"|"$2"|"$3"|"$4"|"$5"|"$9"|"$15"|"$18"|"$19}' |egrep -v '\-\-\-\-|\|\|' 5. omreport system esmlog 6. omreport system alertlog 7. omconfig storage controller action=exportlog controller=0 (the output file will be in /var/log/ directory with the file name starting with "lsi_MMDD".log" ; where MM is the month and DD is the day) 8. cp /var/log/lsi_0407.log data1/LCollect/Disk_Logs 9. tar czvf Messages.tgz /var/log/messages*How to package the log files for uploading:
1. cd (this will change the directory to the home directory for root user which is /root) 2. tar czvf Disk_Logs.tgz LCollect/*
Copy the tar bundle to mdw using 'scp' and log out of node: scp "Disk_Logs.tgz" mdw:/tmp
Using WinSCP or any scp/ftp client connect to mdw and login with root credentials. In the WinSCP GUI, change directory to /tmp and remote copy the "Disk_Logs".tgz file to local workstation. Attach the file to the Service Request.
How do I analysis the logs?
Untar the log files and change the directory:
tar zxvf Disk_Logs.tgz cd data1/LCollect/Disk_Logs
Need to verify the logs to find out which drive is faulty:
1. omreport storage pdisk controller=0 -fmt tbl | awk -F"|" '{print $1"|"$2"|"$3"|"$4"|"$5"|"$9"|"$15"|"$18"|"$19}' |egrep -v
'\-\-\-\-|\|\|'
ID | Status | Name | State | Power Status| Failure Predicted| Capacity | Hot Spare| Vendor ID
0:0:8 | Ok | Physical Disk 0:0:8 | Online| Spun Up | No | 558.38 GB (599550590976 bytes)| No | DELL(tm)
0:0:9 | Critical | Physical Disk 0:0:9 | Online| Spun Up | Yes | 558.38 GB (599550590976 bytes)| No | DELL(tm)
0:0:10| Ok | Physical Disk 0:0:10| Online| Spun Up | No | 558.38 GB (599550590976 bytes)| No | DELL(tm)
2. omreport storage vdisk controller=0 -fmt tbl | awk -F"|" '{print $3"|"$2"|"$4"|"$6"|"$7"|"$11"|"$12"|"$13"|"$15}' | egrep -v '\-\-\-\-|\|\|' Name | Status | State| Layout| Size | Read Policy | Write Policy | Cache Policy | Disk Cache Policy Virtual Disk 0| Ok | Ready| RAID-5| 48.01 GB (51548651520 bytes) | Adaptive Read Ahead| Write Back | Not Applicable| Disabled Virtual Disk 1| Ok | Ready| RAID-5| 2,743.86 GB (2946202337280 bytes)| Adaptive Read Ahead| Write Back | Not Applicable| Disabled Virtual Disk 2|Degraded| Ready| RAID-5| 48.01 GB (51548651520 bytes) | Adaptive Read Ahead| Write Back | Not Applicable| Disabled Virtual Disk 3|Degraded| Ready| RAID-5| 2,743.86 GB (2946202337280 bytes)| Adaptive Read Ahead| Write Back | Not Applicable| DisabledIn the above command we need to check for Write Policy / Cache Policy changes.
3. grep 'Unexpected sense.*Sense key: 3 Sense code: 11' messages* | awk '{print $1,$2,$(NF-4)}' | sort | uniq -c 2 messages.1:Apr 2 0:5:10 45 messages.4:Apr 6 0:2:9 2 messages.4:Apr 27 0:0:9 4 messages:Apr 24 0:0:10
4. grep 'Unexpected sense.*Sense: 3\/11\/00' lsi_0501.log | awk '{print $1,$7,$8,$(NF)}' | sort|uniq -c
8 03/23/16 PD 0a(e0x20/s10) 3/11/00
8 04/06/16 PD 09(e0x20/s9) 3/11/00
167 04/07/16 PD 09(e0x20/s9) 3/11/00 <-- 3/11 error have occurred 167 times
7 04/27/16 PD 0a(e0x20/s10) 3/11/00
11 05/01/16 PD 0a(e0x20/s10) 3/11/00
5. grep Corrected lsi_0501.log | awk '{print $1,$5,$6,$7,$11,$(NF-2)}'| sort | uniq -c
3 03/23/16 110=Corrected medium error PD 0a(e0x20/s10)
158 04/06/16 110=Corrected medium error PD 09(e0x20/s9) <-- The corrected errors have going up to 158 which is not a good sign.
3 04/27/16 110=Corrected medium error PD 0a(e0x20/s10)
9 05/01/16 110=Corrected medium error PD 0a(e0x20/s10)
6. From the pdisk output we need to look for: Critical0:0:9 | Critical | Physical Disk 0:0:9 | Online| Spun Up | Yes | 558.38 GB (599550590976 bytes)| No | DELL(tm)
ID : 0:0:9
Status : Critical
Name : Physical Disk 0:0:9
State : Online
After checking all the above we are good to replace bad drive.
Comments
Post a Comment