wiki:TroubleShootingComputers

Version 15 (modified by mnoethe, 7 years ago) ( diff )

--

Trouble Shooting - Computers

Information about the Computers

You can find information about the Computers on La Palma on the computing page.

KVM Software

It can be helpful to directly look onto the screen output, install the Raritan KVM Multiplatform Client and use it to connect to 161.72.93.135.

Restart a computer after a power cut

You can switch on the computers and the KVM switch from the power switches listed on the internal links page.

Startup Procedure

  1. bring up GATE (LDAP, DNS, DHCP, Gateway (Masquerading))
  2. bring up RAID
  3. bring up NEWDAQ (NFS Home, Raid)
  4. bring up NEWDATA
  5. bring up AUX
  6. make sure that the needed mountpoints are there
  7. make sure that the needed screen-sessions are running (details Trouble Shooting Software)

Check if all system services are running

On each server:

systemctl list-units --failed

Try to restart failed services

systemctl status <service>
systemctl restart <service>

If networking is in failed and the error is RNETLINK answers: File exists, you need to flush the ip address of the corresponding device. Make sure you have KVM access to the machine! As you might have to flush the device you are using to connect to the machine.

ip addr flush <device>
systemctl restart networking

mountpoints:

  • newdata: /newdaq and /home from newdaq
  • gate: /users from newdaq (home of other machines)
  • aux: /home from newdaq

If missing, do sudo mount -a on the corresponding machine.

Shutdown Procedure

  1. shut down aux, gui
  2. shut down daq
  3. shut down data
  4. shut down newdaq
  5. show down gate

Restarting a hanging PC

Symptom

  • the PC can't be reached per ssh, or something similar
  • be aware, that when all computers (except for gate) seem to hang, it is normally newdaq which hangs, the other only try to mount the home from the raid of newdaq, so they hang too

Solution

If it's not to late in the night, try to call an expert before you power cycle the computers.

When you have to restart more than one PC, be sure you follow the Shutdown and Start-up procedure above.

You can switch on the computers from 10.0.100.234 (see http://fact-project.org/internal.html)

or you can power cycle the hanging computer from any other computer on the FACT internal network:

  • go in /usr/local/bin
  • execute one off the following scripts: aux_off, gui_off, gate_off, daq_off, data_off
  • wait a few minutes
  • execute one off the following scripts: aux_ON, gui_ON, gate_ON, daq_ON, data_ON
  • Rebooting will take a few minutes for aux, gui, gate and about 10 min. for daq and data, respectively

or power cycle the hanging computer manually from the FACT-container.

What to do if daq is dead?

If daq dies, the qla and few other scripts have to be moved to newdaq. Find some notes here: DaqDead

Note: See TracWiki for help on using the wiki.