[nsd-users] Stale lock file (an nsdc problem)

Shane Kerr shane at ca.afilias.info
Wed Oct 15 10:06:29 UTC 2008


Hello,

We have a stale lock file that is preventing nsdc from running. From the
log file our cron job produces:

Wed Oct 15 05:40:01 UTC 2008
 5:40AM  up 16:27, 0 users, load averages: 6.33, 8.02, 7.31
ns0a    25383         1      8175
/opt/prod/nsd/sbin/nsdc: line 138: /opt/nshome/ns0a/var/nsd.db.lock: cannot overwrite existing file
database locked by PID: 78717
aborting...
ns0a    25383         1      8175

Wed Oct 15 08:40:00 UTC 2008
 8:40AM  up 19:27, 0 users, load averages: 9.36, 5.78, 5.18
ns0a    25383         1      8175
/opt/prod/nsd/sbin/nsdc: line 138: /opt/nshome/ns0a/var/nsd.db.lock: cannot overwrite existing file
database locked by PID: 78717
aborting...
ns0a    25383         1      8175

This lock file does exist, and does point to process 78717:

[root at app7 /opt/nshome/ns0a/var]# ls -l
total 1639596
-rw-r--r--  1 ns0a  ns0a  601899131 Oct 15 09:09 ixfr.db
-rw-r--r--  1 root  ns0a  426287130 Oct 14 06:01 nsd.db
-rw-r--r--  1 root  ns0a          0 Oct 14 08:49 nsd.db.78717
-r--r--r--  1 root  ns0a         30 Oct 14 08:49 nsd.db.lock
-rw-r--r--  1 ns0a  ns0a          6 Oct 14 20:31 nsd.pid
-rw-r--r--  1 root  ns0a       1079 Jul 23 02:50 org.afilias-nst.info.zone
-rw-r--r--  1 root  ns0a       1075 Jul 23 02:50 org.afilias-nst.org.zone
-rw-r--r--  1 root  ns0a  649832393 Oct 14 08:49 org.zone
-rw-r--r--  1 ns0a  ns0a       2414 Sep  5 00:21 xfrd.state
[root at app7 /opt/nshome/ns0a/var]# cat nsd.db.lock
database locked by PID: 78717

But the process is not running:

[root at app7 /opt/nshome/ns0a/var]# ps ax | grep 78717
78079  p0  S+     0:00.00 grep 78717


As with the signal() case reported a few months ago, nsdc.sh needs a bit
of love. The lock() function needs to be improved so it handles stale
locks. Something like this would probably work (and is even NFS-safe),
but requires that everything that writes to the lock use the PID and not
"database locked by PID: $$" as the contents.

lock() {
        # create a temporary file based on our PID
        TEMPFILE="${dbfile}.$$"
        echo $$ > $TEMPFILE || (echo "error creating temporary file, aborting..."; exit 1)

        # try to lock using this file
        if ln $TEMPFILE ${lockfile} 2>/dev/null; then
                rm -f $TEMPFILE
                return
        fi

        # if that did not work, see if the locking process exists
        PID=`cat ${lockfile}`
        if kill -0 $PID 2>/dev/null; then
                rm -f $TEMPFILE
                echo "database locked by PID: $PID"
                exit 1
        fi

        # if the locking process does not exist, consider the lock stale
        echo "removing stale lockfile"
        rm -f ${lockfile}

        # lock the database
        if ! ln $TEMPFILE ${lockfile} 2>/dev/null; then
                rm -f $TEMPFILE
                echo "unable to lock database"
                exit 1
        fi
}


Bad things happen to good processes. :)

Cheers,

--
Shane




More information about the nsd-users mailing list