For a large Cacti network monitoring implementation on a FreeBSD machine, we found that though the SNMP polling went relatively quickly, updating the RRD data files with the collected data sometimes took more than the alloted five-minute polling run. This means that the too-long-running prior job walked over the following up, delaying that one too.
If some unrelated process (say, a network backup) started competing for resources, it wasn't uncommon for this to turn into a cascading meltdown. I've seen dozens of polling jobs running at once while checking the system in the morning, and the only solution was to just kill everything, let the system settle down, accept the hours of lost data, and let things pick back up with a clean slate.
With the premise that missing a single polling period is superior to swamping the machine, I created a tool, lockrun, which wraps the given command line (such as a cron job) with the protection of a lockfile.
If a new polling period comes around, but the lockfile is still in use from the previous job, it will exit with an error message which is routed to the user via cron's normal email mechanisms. This way, we'll never have two of these jobs running at the same time.
This is a bit more sophisticated than just touching a lockfile or storing a PID in a file: this uses actual file locking, whose locks are automatically released when the program exits for any reason. This includes killing with -9, core dumping, or even a system crash. There are no files to clean up at system boot time either.
I'll note that this is not really a solution to the problem: the real problem was an underpowered machine (which we have since remedied), and it doesn't replace a proper queuing mechanism. Instead, this is a fail-safe to prevent system meltdown, and it's really served us well.
I've been using this for months on all cron jobs which have any chance at running long, and it's been completely bulletproof as far as I can tell.
I hope it's useful to others.