[Nevis-linux] Linux cluster problems

William Seligman seligman at nevis.columbia.edu
Mon Jan 14 09:57:06 EST 2008


Several problems occurred with the Linux cluster over the weekend.  They are 
apparently unrelated; it appears to be an unfortunate co-incidence of some 
problems all occurring at once.

In decreasing order of severity:

- riverside, the Neutrino group server, is down with a severe hardware problem; 
nothing appears on the screen when I turn it on.  This issue has the highest 
priority.  If I have no insight in the next 15 minutes, I'm going to restore the 
riverside:/home directory on some other box so the Neutrino group can get their 
e-mail and do other work.  This box is also the condor master server, so condor 
is down too.

- There is a problem with one of the switches in the computer room.  This 
problem caused the mail server to crash on Friday.  I've partially fixed this 
problem (the mail server is working), but the batch nodes are receiving no 
network traffic.  This issue has secondary priority.

- karthur, the library server and an ATLAS/D0 workgroup server, also crashed on 
Friday.  I rebooted it remotely, and it seems to be OK now.  I'll investigate 
this issue when I get the chance.

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
PO Box 137                | http://www.nevis.columbia.edu/~seligman/
Irvington NY 10533 USA    | XDI: http://public.xdi.org/=william.seligman
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3277 bytes
Desc: S/MIME Cryptographic Signature
Url : http://listserv.nevis.columbia.edu/pipermail/nevis-linux/attachments/20080114/1dd7fe0e/attachment.bin 


More information about the Nevis-linux mailing list