Monday, February 06, 2017

Unmounting a volume with open file handles to deleted files without killing the process

Recently we've been turning on compression in our Oracle databases because it saves us a ton of disk space and actually improves the performance of the databases.  Among other things the process involves writing the newly compressed tablespaces to new data files (the systems in question are not using ASM).  Because we're moving the DB's from where they are, that gives us the opportunity to do some housecleaning before moving them back, such as un-mounting, checking and possibly resizing the existing filesystems.

Occasionally it will occur that after the data has been migrated away and the old data files deleted, that Oracle still has open handles to the deleted files.

So the DBA tells me the filesystem is clear and I can have it, but then this happens:

[root@kwt-r3oql00 E1Q]# umount /dev/mapper/vg_kwt_r3oql20_s00-oracle_E1Q_sapdata7
umount: /oracle/E1Q/sapdata7: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))

And when we go to see who the culprit is:

[root@kwt-r3oql00 E1Q]# lsof /oracle/E1Q/sapdata7
COMMAND     PID   USER   FD   TYPE DEVICE    SIZE/OFF     NODE NAME
oracle_13 13214 oracle  520u   REG 253,17 10485768192 17317890 /oracle/E1Q/sapdata7/sr3usr_1/sr3usr.data1 (deleted)
oracle_13 13254 oracle  519u   REG 253,17 10485768192 17317890 /oracle/E1Q/sapdata7/sr3usr_1/sr3usr.data1 (deleted)
oracle_13 13258 oracle  520u   REG 253,17 10485768192 17317890 /oracle/E1Q/sapdata7/sr3usr_1/sr3usr.data1 (deleted)
oracle_13 13286 oracle  517u   REG 253,17 10485768192 17317890 /oracle/E1Q/sapdata7/sr3usr_1/sr3usr.data1 (deleted)
oracle_13 13294 oracle  520u   REG 253,17 10485768192 17317890 /oracle/E1Q/sapdata7/sr3usr_1/sr3usr.data1 (deleted)
oracle_13 13526 oracle  506u   REG 253,17 10485768192 17317890 /oracle/E1Q/sapdata7/sr3usr_1/sr3usr.data1 (deleted)
oracle_13 13996 oracle  512u   REG 253,17 10485768192 17317890 /oracle/E1Q/sapdata7/sr3usr_1/sr3usr.data1 (deleted)
oracle_14 14424 oracle  493u   REG 253,17 10485768192 17317890 /oracle/E1Q/sapdata7/sr3usr_1/sr3usr.data1 (deleted)
oracle_14 14488 oracle  515u   REG 253,17 10485768192 17317890 /oracle/E1Q/sapdata7/sr3usr_1/sr3usr.data1 (deleted)
oracle_14 14574 oracle  403u   REG 253,17 10485768192 17317890 /oracle/E1Q/sapdata7/sr3usr_1/sr3usr.data1 (deleted)
oracle_14 14712 oracle  506u   REG 253,17 10485768192 17317890 /oracle/E1Q/sapdata7/sr3usr_1/sr3usr.data1 (deleted)
oracle_23 23296 oracle  273u   REG 253,17 10485768192 17317890 /oracle/E1Q/sapdata7/sr3usr_1/sr3usr.data1 (deleted)
oracle_24 24260 oracle  277u   REG 253,17 10485768192 17317890 /oracle/E1Q/sapdata7/sr3usr_1/sr3usr.data1 (deleted)


This is super annoying because the DB seems to have forgotten about those open handles and internally has no mechanism for gracefully releasing them.  If you can tolerate the outage restarting the DB will release those handles, but if you can't there is another solution.

The GNU debugger (gdb) will allow you to attach to the process and close the file handles.  It's not without risk, but I've been using it successfully for years.  The main drawback is how tedious it is to sift through /proc looking for the open file handles and closing them one by one with the debugger interactively.

Laziness being the father of invention, I wrote the following script to do this for you.  Simply provide it the path to check for open deleted file handles, and it will find the processes, attach the debugger and remove the handles.


#!/bin/bash

FS=$1

if [[ -z $FS ]]; then
   echo "Please provide a filesystem path to check for open file descriptors to deleted files"
   exit 1
fi

echo "WARNING: This is super-dangerous.  Please don't use it in Prod without a "
echo "         really good reason and a change request/blackout"
read -p "Type 'C' and Enter to continue, anything else to abort: " CHECK

if [[ "$CHECK" != "C" ]]; then
   echo "Aborted by user.  No changes made."
   exit 0
fi

# Get a list of processes with open file handles to the given directory
PIDLIST=`lsof $FS | egrep -v 'PID' | awk '{print $2}' | sort -u`


for PID in $PIDLIST; do

  unset DESCLIST

  # Get a list of file descriptors for that PID that refer to deleted files
  DESCLIST=`ls -l /proc/${PID}/fd | grep deleted | awk '{print $9}'`

  # Display the list
  echo "The Process $PID has open file descriptors for deleted files:"
  echo "${DESCLIST}" | sed 's/^/  /g'

  # Create a name for the script for gdb
  DEBUGSCRIPT=/tmp/.$PID

  # Remove any previous version of the script
  if [[ -f $DEBUGSCRIPT ]]; then /bin/rm $DEBUGSCRIPT; fi

  # Write a close command for each deleted file descriptor
  for DFD in $DESCLIST; do
     echo "p close(${DFD})" >> $DEBUGSCRIPT
  done

  # Detach and close the debugger
  echo "detach" >> $DEBUGSCRIPT
  echo "quit" >> $DEBUGSCRIPT

  echo "Forcibly closing handles for deleted files on process $PID"

  # Run the debugger in batch mode to execute the script and close the handles
  /usr/bin/gdb --pid $PID --batch -x $DEBUGSCRIPT

  # Wait before deleting the script
  sleep 1

  # Clean up the script
  /bin/rm $DEBUGSCRIPT

done