Sun Microsystems Logo
Products and Services
 
Support and Training
 
 

Previous Previous     Contents     Index     Next Next
Chapter 24

Troubleshooting Solaris Volume Manager

This chapter describes how to troubleshoot problems related to Solaris Volume Manager. This chapter provides both general troubleshooting guidelines and specific procedures for resolving some particular known problems.

This chapter includes the following information:

This chapter describes some Solaris Volume Manager problems and their appropriate solution. It is not intended to be all-inclusive but rather to present common scenarios and recovery procedures.

Troubleshooting Solaris Volume Manager (Task Map)

The following task map identifies some procedures needed to troubleshoot Solaris Volume Manager.

Task

Description

Instructions

Replace a failed disk

Replace a disk, then update state database replicas and logical volumes on the new disk.

How to Replace a Failed Disk

Recover from disk movement problems

Restore disks to original locations or contact product support.

Recovering from Disk Movement Problems

Recover from improper /etc/vfstab entries

Use the fsck command on the mirror, then edit the /etc/vfstab file so the system will boot correctly.

How to Recover From Improper /etc/vfstab Entries

Recover from a boot device failure

Boot from a different submirror.

How to Recover From a Boot Device Failure

Recover from insufficient state database replicas

Delete unavailable replicas by using the metadb command.

How to Recover From Insufficient State Database Replicas

Recover configuration data for a lost soft partition

Use the metarecover command to recover configuration data for soft partitions.

How to Recover Configuration Data for a Soft Partition

Recover a Solaris Volume Manager configuration from salvaged disks

Attach disks to a new system and have Solaris Volume Manager rebuild the configuration from the existing state database replicas.

How to Recover a Configuration

Overview of Troubleshooting the System

Prerequisites for Troubleshooting the System

To troubleshoot storage management problems related to Solaris Volume Manager, you need to do the following:

  • Have root privilege

  • Have a current backup of all data

General Guidelines for Troubleshooting Solaris Volume Manager

You should have the following information on hand when you troubleshoot Solaris Volume Manager problems:

  • Output from the metadb command.

  • Output from the metastat command.

  • Output from the metastat -p command.

  • Backup copy of the /etc/vfstab file.

  • Backup copy of the /etc/lvm/mddb.cf file.

  • Disk partition information, from the prtvtoc command (SPARC® systems) or the fdisk command (x86-based systems)

  • Solaris version

  • Solaris patches installed

  • Solaris Volume Manager patches installed


Tip - Any time you update your Solaris Volume Manager configuration, or make other storage or operating environment-related changes to your system, generate fresh copies of this configuration information. You could also generate this information automatically with a cron job.


General Troubleshooting Approach

Although there is no one procedure that will enable you to evaluate all problems with Solaris Volume Manager, the following process provides one general approach that might help.

  1. Gather information about current configuration.

  2. Look at the current status indicators, including the output from the metastat and metadb commands. There should be information here that indicates which component is faulty.

  3. Check the hardware for obvious points of failure. (Is everything connected properly? Was there a recent electrical outage? Have you recently added or changed equipment?)

Replacing Disks

This section describes how to replace disks in a Solaris Volume Manager environment.


Caution! Caution - If you have soft partitions on a failed disk or on volumes built on a failed disk, you must put the new disk in the same physical location, with the same c*t*d* number as the disk it replaces.


ProcedureHow to Replace a Failed Disk

  1. Identify the failed disk to be replaced by examining the /var/adm/messages file and the metastat command output.

  2. Locate any state database replicas that might have been placed on the failed disk.

    Use the metadb command to find the replicas.

    The metadb command might report errors for the state database replicas located on the failed disk. In this example, c0t1d0 is the problem device.

    # metadb
       flags       first blk        block count
      a m     u        16               1034            /dev/dsk/c0t0d0s4
      a       u        1050             1034            /dev/dsk/c0t0d0s4
      a       u        2084             1034            /dev/dsk/c0t0d0s4
      W   pc luo       16               1034            /dev/dsk/c0t1d0s4
      W   pc luo       1050             1034            /dev/dsk/c0t1d0s4
      W   pc luo       2084             1034            /dev/dsk/c0t1d0s4

    The output shows three state database replicas on slice 4 of the local disks, c0t0d0 and c0t1d0. The W in the flags field of the c0t1d0s4 slice indicates that the device has write errors. Three replicas on the c0t0d0s4 slice are still good.

  3. Record the slice name where the state database replicas reside and the number of state database replicas, then delete the state database replicas.

    The number of state database replicas is obtained by counting the number of appearances of a slice in the metadb command output. In this example, the three state database replicas that exist on c0t1d0s4 are deleted.

    # metadb -d c0t1d0s4


    Caution! Caution - If, after deleting the bad state database replicas, you are left with three or fewer, add more state database replicas before continuing. This will help ensure that configuration information remains intact.


  4. Locate and delete any hot spares on the failed disk. Use the metastat command to find hot spares. In this example, hot spare pool hsp000 included c0t1d0s6, which is then deleted from the pool.

    # metahs -d hsp000 c0t1d0s6
    hsp000: Hotspare is deleted

  5. Physically replace the failed disk.

  6. Logically replace the failed disk using the devfsadm command, luxadm command, or other commands as appropriate for your hardware and environment.

  7. Update the Solaris Volume Manager state database with the device ID for the new disk using the devfsadm -u cntndn command.

    In this example, the new disk is c0t1d0.

    # metadevadm -u c0t1d0

  8. Repartition the new disk.

    Use the format command or the fmthard command to partition the disk with the same slice information as the failed disk. If you have the prtvtoc output from the failed disk, you can format the replacement disk with fmthard -s /tmp/failed-disk-prtvtoc-output

  9. If you deleted state database replicas, add the same number back to the appropriate slice.

    In this example, /dev/dsk/c0t1d0s4 is used.

    # metadb -a -c 3 c0t1d0s4

  10. If any slices on the disk are components of RAID 5 volumes or are components of RAID 0 volumes that are in turn submirrors of RAID 1 volumes, run the metareplace -e command for each slice.

    In this example, /dev/dsk/c0t1d0s4 and mirror d10 are used.

    # metareplace -e d10 c0t1d0s4

  11. If any soft partitions are built directly on slices on the replaced disk, run the metarecover -d -p command on each slice containing soft partitions to regenerate the extent headers on disk.

    In this example, /dev/dsk/c0t1d0s4 needs to have the soft partition markings on disk regenerated, so is scanned and the markings are reapplied, based on the information in the state database replicas.

    # metarecover c0t1d0s4 -d -p 

  12. If any soft partitions on the disk are components of RAID 5 volumes or are components of RAID 0 volumes that are submirrors of RAID 1 volumes, run the metareplace -e command for each slice.

    In this example, /dev/dsk/c0t1d0s4 and mirror d10 are used.

    # metareplace -e d10 c0t1d0s4

  13. If any RAID 0 volumes have soft partitions built on them, run the metarecover command for each of the RAID 0 volume.

    In this example, RAID 0 volume d17 has soft partitions built on it.

    # metarecover d17 -m -p

  14. Replace hot spares that were deleted, and add them to the appropriate hot spare pool or pools.

    # metahs -a hsp000 c0t0d0s6
    hsp000: Hotspare is added

  15. If soft partitions or non-redundant volumes were affected by the failure, restore data from backups. If only redundant volumes were affected, then validate your data.

    Check the user/application data on all volumes. You might have to run an application-level consistency checker or use some other method to check the data.

Previous Previous     Contents     Index     Next Next