[Nevis-linux] New condor job restrictions

William Seligman seligman at nevis.columbia.edu
Thu Feb 10 15:24:26 EST 2011


About six weeks ago, I sent out the attached message about changes to the Nevis
condor batch system. On the day I was going to make them, there were major
problems with the Nevis network. I decided that I wouldn't make any changes to
condor until all the 1Gb/s network issues were settled.

I would like to implement item 2 in the attached message this Monday,
15-Feb-2011; that's putting in a 1GB memory limit for running jobs. Putting in
this change means I have to restart condor on all the nodes, which I prefer to
do when no jobs are running. My impression is that our condor users like to
submit jobs to run over the weekend, then look at the results on Monday. If I'm
wrong or you can suggest a better time, please let me know.


I will implement item 1 by editing the /etc/exports files on all the workgroup
servers so they no longer export /home or /data with write permission to the
batch nodes (hermesXX, kennelXX, xeniaXX).

I will send out the exact date of this change, after I have installed some
additional disk storage for the Neutrino group. This will take a week or two.

In the meantime, you may want to anticipate this change: inspect your condor
jobs; make sure your scripts do not directly write to /a/data/... or
/a/home/...; use condor's file-transfer mechanism instead.


For more information, see

<http://www.nevis.columbia.edu/twiki/bin/view/Nevis/Condor>


On 12/16/10 12:37 PM, William Seligman wrote:
> To the users of the condor batch system at Nevis:
> 
> There's been a problem with the condor batch system: Someone can crash a server,
> or the entire cluster, by submitting a condor job. It's become a "rite of
> passage": as each new person learns about condor, they submit a job without
> thinking about resource management, and crash their home server.
> 
> We've identified two ways in which a condor job can crash a system. I propose to
> make changes to the condor and cluster configuration next week, on Wed
> 22-Dec-2010, to help prevent this. The changes are:
> 
> 
> *1. The condor batch nodes will no longer be able to write to the disks of other
> systems via NFS.*
> 
> This has been the most common cause of server crashes. You submit a job that
> writes to a directory via automount; e.g., /a/data/tanya/seligman. You're fine
> if only one job does that; if 300 jobs are doing simultaneous sustained writes
> to the same directory via NFS, the NFS server (tanya in this example) may
> experience kernel faults and a system crash.
> 
> With this change, an attempt by a condor job to write to an automount directory
> (one whose path begins with /a) will get the error "permission denied." The
> job's submitter will have to figure out how to use condor's file-transfer
> mechanism instead. This increases the learning curve of using condor, but it's
> better than crashing a server.
> 
> (It's also possible to slow down a server with a large number of sustained
> reads; I don't have a solution to this yet.)
> 
> 
> *2. A job will be aborted if it uses more than 1GB of memory.*
> 
> A couple of weeks ago, the entire Nevis Linux cluster ground to a halt. The
> cause was a condor job that used 2GB of RAM. The systems on the cluster are
> limited to 1GB RAM/processing queue; as each system was asked to use twice as
> much memory as it had, it began to swap pages continuously and did nothing else.
> 
> At present, the only group with a simulation that uses more than 1GB RAM per job
> is ATLAS. Therefore, this restriction only affects ATLAS simulation jobs that
> are submitted to the "upstairs" cluster instead of the Tier3 cluster.
> 
> It's hard to predict the future needs of the other Nevis groups. If another
> group begins to use more than 1 GB RAM/job, it's better to find out about it by
> having those jobs aborted than by a cluster shutdown. If that ever happens, we
> might think about upgrading the RAM on the cluster systems.
> 
> 
> If you have any comments about these changes, please let me know. If I hear
> nothing, I'll go ahead.

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5894 bytes
Desc: S/MIME Cryptographic Signature
Url : http://listserv.nevis.columbia.edu/pipermail/nevis-linux/attachments/20110210/a180cf87/attachment.bin 


More information about the Nevis-linux mailing list