LBC Troubleshooting

Please feel free to update this page with any LBC troubleshooting notes or links as appropriate. Thanks!



LBCArchitecture.png

Log Files

The following are listed in the likely order of perusal when there are problems:

File Where What
lbc.log on the CMU in /lbccontrol/current/log Main LBC log file - this is the file parsed by the GUI
error_log
access_log
on the CMU in /var/log/httpd these are the web server logs - GUI problems can be found here
lbcia.log on the tech Windows PCs in C:\lbcfpia\src output from the IDL procedures for image analysis
this file is appended to with each run of the lbciaRun program;
contains timestamps
this file is copied to the CMU into /images/tftp/XTech/
Available from the links: Rlbcia.log and Blbcia.log
blueYYYY-MM-DD.Log
redYYYY-MM-DD.Log
on mountain machine in /home/lbceng/FPIAlogs numbers, fits filenames and filter names
AOParam.txt on the tech Windows PCs in D:\ao the active optics corrections calculated by the IA processing
this file is copied to the CMU into /images/tftp/XTech/
Available from the links: RAOParam.txt and BAOParam.txt
this data is also logged in the lbc.log as Active Optics Pnn
lbciaRun.txt on the tech Windows PCs in D:\
(which is the LBCIA_HOME env var on the PC)
output from the program lbciaRun.exe
this file is appended to with each run of the lbciaRun program;
does not currently contain timestamps
messages on the CMU in /var/log The CMU system log - nfs problems, panics, IIF messages, etc.
filters.log
power.log
housekeeping.log
...
on the CMU, in any directory these are output from the testxxxx programs
hkblue.dat on the CMU in /lbccontrol/current/log Blue housekeeping data
hkred.dat on the CMU in /lbccontrol/current/log Red housekeeping data

LBC Power Control

The LBC software uses DataProbe devices to control power to all of the hardware, including the Windows PCs. The LBC GUI does not query the power status of the components. If the software is bounced (via lbckill/lbcstart ) and the power.conf file is deleted, the GUI will reflect nothing ON, even if some components are powered-up. The software only checks the power status when a command is issued -- TURN ON/OFF. If the component is already powered up, the software will see that when it tries to power it up. On the lbcstart command, the software looks at the power.conf file to determine what it should bring up automatically. If you delete the power.conf file, it will not bring up anything and you have to TURN ON via the GUI. If something failed during a TURN ON command, the power.conf file will reflect that component as disabled and the lbcstart will not bring it up automatically.

The power.py GUI does NOT refresh automatically. It is designed that way. It has a big Refresh button at the bottom. It IS very easy to turn something on/off that you may not want to, that is why it is just a troubleshooting/engineering tool.

The LBC software is stopped with the lbckill command. Nothing is power-cycled.
The LBC software is started with the lbcstart command. It will use the power.conf file when it starts to restore the last logged power state. If this file is deleted, nothing is powered up automatically.
The GUI commands for TURN ON/OFF, control power to the components, including the Windows PCs.

When troubleshooting a problem, you must go to the log display on the GUI. The errors are available in the GUI - you can ignore all of the Status/Note/Warning messages and just look at the Errors. If it is not a power problem (there is nothing noted about power in the log messages) you can likely use lbckill, delete the power.conf, and lbcstart.

On the other hand, if you see power problems in the log during startup, you can use power.py to visually see the power states to determine what is actually on/off. Or, if you want to manually power a single component -- like we have done with the shutter failures in the past, or when a tracker CCD is having problems or a particular filter wheel has a hardware problem.

If you cannot ping one of the dataprobes, then the LBC software will fail trying to power things up. The dataprobe may have to be manually power-cycled, via the electronics box. For example, here are pictures of the red electronics cabinet and the labeling on the red dataprobe power cables.

RedElectronicsBox-bottom-sm.JPG RedDataProbePowerCables-sm.JPG

CMU Disk Full - GUI not responding, all blank

The old CMU /home disk is only 20GB. We keep the whole LBC release there which is large because of all of the documentation. It typically stays about 70 to 80% full because we keep multiple releases there too. We have seen it fill up on unusually large core files (see IT 4943). This appears to manifest itself as a hang of the GUI.

On the CMU (as user root or lbccontrol), check a df -h command. If the /home partition is full, check for a core.xxx file. If you see a large one there (it would be in the /home/lbccontrol directory on the old CMU, on the new CMU in /lbccontrol/current ), move it to /images or delete it, if you cannot move it.
Of course, the LBC software shouldn't be core'ing, so this should not be a common occurrence.

Look in the log file for these types of messages:
watchdog on StatusThread started a new instance 

Housekeeping timeouts on both sensors (vacuum and temperature) on one side can cause things to eventually lock up. The software keeps re-trying and re-timing out:
2014/09/05 18:29:20.511874 W B HKEEPING VACUUM   vacuum sensor timeout [src/housekeeping/housekeeping.c:1212]
2014/09/05 18:29:20.511952 W B HKEEPING TEMPERAT could not read temperature value [src/housekeeping/housekeeping.c:1379]
2014/09/05 18:29:20.511975 E B                   hinibit counter updated to 36
2014/09/05 18:29:20.511990 E B HKEEPING VACUUM   pressure: 0.00E+00 [mbar]
2014/09/05 18:29:20.512033 W B HKEEPING VACUUM   hinibit state raised due to pressure (0.00E+00) above 1.00E-03mbar threshold or fake reading [src/housekeeping
/housekeeping.c:1404]
2014/09/05 18:29:20.512060 E B                   hinibit counter updated to 37
2014/09/05 18:29:20.512077 E B HKEEPING TEMPERAT temperature:  0.0 [Kelvin]
2014/09/05 18:29:20.512093 W B HKEEPING TEMPERAT hinibit state raised due to temperature ( 0.0) above 240.0K threshold or fake reading [src/housekeeping/housek
eeping.c:1419]
2014/09/05 18:29:20.512181 W B HKEEPING          subsystem reset start
2014/09/05 18:29:26.411589 W B HKEEPING          close vacuum port
2014/09/05 18:29:26.411619 W B HKEEPING          close temperature port
2014/09/05 18:29:39.570732 W -                   watchdog on StatusThread started a new instance
2014/09/05 18:29:42.590730 W B HKEEPING          subsystem reset completed

We should take this out or fix it! But for now, it requires a restart ( lbckill/lbcstart ) or TURNOFF/TURNON.

Network Disconnects

LBC has its own subnet, but is still subject to failures if there are general network outages. If everything has power, an lbckill and lbcstart should fix hanging status threads, etc. that can occur when we take an overall network hit.

For instance, when they did some network maintenance on 11-Dec-2013, we saw the following errors in the log:

2013/12/11 16:45:46.714904 W R HKEEPING VACUUM   vacuum sensor timeout [src/housekeeping/housekeeping.c:1211]
2013/12/11 16:45:47.364932 W B HKEEPING VACUUM   vacuum sensor timeout [src/housekeeping/housekeeping.c:1211]
2013/12/11 16:45:56.715227 W R HKEEPING TEMPERAT temperature sensor timeout [src/housekeeping/housekeeping.c:1299]
2013/12/11 16:45:56.715367 E R                   hinibit counter updated to 1
2013/12/11 16:45:56.715391 E R HKEEPING VACUUM   pressure: 0.00E+00 [mbar]
2013/12/11 16:45:56.715424 W R HKEEPING VACUUM   hinibit state raised due to pressure (0.00E+00) above 1.00E-03mbar threshold or fake reading [src/housekeeping/housekee
ping.c:1403]
2013/12/11 16:45:56.715454 E R                   hinibit counter updated to 2
2013/12/11 16:45:56.715475 E R HKEEPING TEMPERAT temperature:  0.0 [Kelvin]
2013/12/11 16:45:56.715499 W R HKEEPING TEMPERAT hinibit state raised due to temperature ( 0.0) above 250.0K threshold or fake reading [src/housekeeping/housekeeping.c:
1418]
2013/12/11 16:45:56.715618 W R HKEEPING          subsystem reset start
...
2013/12/11 16:45:58.499663 W -                   watchdog on StatusThread started a new instance
2013/12/11 16:46:18.516619 W -                   watchdog on StatusThread started a new instance
2013/12/11 16:46:38.547300 W -                   watchdog on StatusThread started a new instance
2013/12/11 16:46:52.917037 E R HKEEPING          switching error rc:9 >unable to connect TCP socket< on "192.168.59.123:3,15" [src/power3/power3.c:292]
2013/12/11 16:46:52.917129 E R HKEEPING          power cycle error with automatic retry [src/housekeeping/housekeeping.c:1187]
...
2013/12/11 16:47:49.118856 E B HKEEPING          switching error rc:9 >unable to connect TCP socket< on "192.168.59.113:3,15" [src/power3/power3.c:292]
...
2013/12/11 16:49:58.732998 W -                   watchdog on StatusThread started a new instance
2013/12/11 16:50:18.743626 W -                   watchdog on StatusThread started a new instance
2013/12/11 16:50:38.754280 W -                   watchdog on StatusThread started a new instance
2013/12/11 16:50:58.764923 W -                   watchdog on StatusThread started a new instance

Housekeeping was not logging during this time. Everything came back ok with the lbckill / lbcstart.

LBC Network Switches

There are network switches in each of the two LBC electronic boxes.
During troubleshooting, always ping a device you are having trouble communicating with (make sure it's powered up first!). For instance, in Oct-2014, we had a problem with the blue portserver. In the log:
  2014/10/14 22:23:28.259913 W B HKEEPING          not possible to connect to PortServer
  2014/10/14 22:23:36.324235 W B HKEEPING          not possible to connect to PortServer
  ...

On the CMU, the ping to this device (192.168.59.112) came back with "Destination unreachable". It turned out that the switch had disabled the port because it was in an error state. To see the status of the switches in the LBC electronics boxes, log in to the switch and run the show interface status command:
 > ssh lbc@swlbcr.network
******  LBTO personnel only -- Authorized access only ******

Password: 

swlbcr>show int status

Port      Name               Status       Vlan       Duplex  Speed Type
Gi0/1     rportserver.lbc    connected    159        a-full  a-100 10/100/1000BaseTX
Gi0/2     rdataprobe1.lbc    connected    159        a-half   a-10 10/100/1000BaseTX
Gi0/3     rdataprobe2.lbc    connected    159        a-half   a-10 10/100/1000BaseTX
Gi0/4                        disabled     1            auto   auto 10/100/1000BaseTX
Gi0/5                        disabled     1            auto   auto 10/100/1000BaseTX
Gi0/6                        disabled     1            auto   auto 10/100/1000BaseTX
Gi0/7     [LBC-diag]         notconnect   159          auto   auto 10/100/1000BaseTX
Gi0/8     [DHCP]             notconnect   149          auto   auto 10/100/1000BaseTX
Gi0/9                        notconnect   1            auto   auto Not Present
Gi0/10    Uplink to core     connected    trunk      a-full a-1000 1000BaseSX SFP
swlbcr>

On the blue side, the old switch has been installed, so the output is a little different:
 > ssh lbc@swlbcb.network
lbc@swlbcb.network's password: 

**  --->  Unauthorized Access is Strictly Forbidden  <--- **

swlbcb>show int status

Port      Name               Status       Vlan       Duplex  Speed Type
Fa0/1     bportserver.lbc    connected    159        a-half  a-100 10/100BaseTX
Fa0/2                        notconnect   159          auto   auto 10/100BaseTX
Fa0/3     bdataprobe1.lbc    connected    159        a-half   a-10 10/100BaseTX
Fa0/4                        notconnect   159          auto   auto 10/100BaseTX
Fa0/5     bdataprobe2.lbc    connected    159        a-half   a-10 10/100BaseTX
Fa0/6                        notconnect   159          auto   auto 10/100BaseTX
Fa0/7                        notconnect   159          auto   auto 10/100BaseTX
Fa0/8                        notconnect   159          auto   auto 10/100BaseTX
Fa0/9                        notconnect   159          auto   auto 10/100BaseTX
Fa0/10                       notconnect   159          auto   auto 10/100BaseTX
Fa0/11                       notconnect   159          auto   auto 10/100BaseTX
Fa0/12                       notconnect   159          auto   auto 10/100BaseTX
Fa0/13                       notconnect   159          auto   auto 10/100BaseTX
Fa0/14                       notconnect   159          auto   auto 10/100BaseTX
Fa0/15                       notconnect   159          auto   auto 10/100BaseTX
Fa0/16                       notconnect   159          auto   auto 10/100BaseTX
Fa0/17                       notconnect   159          auto   auto 10/100BaseTX
Fa0/18                       notconnect   159          auto   auto 10/100BaseTX
Fa0/19                       notconnect   159          auto   auto 10/100BaseTX
Fa0/20                       notconnect   159          auto   auto 10/100BaseTX
Fa0/21                       notconnect   159          auto   auto 10/100BaseTX
Fa0/22                       notconnect   159          auto   auto 10/100BaseTX
Fa0/23                       notconnect   159          auto   auto 10/100BaseTX
Fa0/24                       notconnect   159          auto   auto 10/100BaseTX
Gi0/1                        connected    trunk      a-full a-1000 1000BaseSX
Gi0/2                        notconnect   1            auto   auto unknown
swlbcb> 

If you see a port that should be up, but is disabled, contact the IT group and get it reset.

Using Minicom

The minicom allows you to connect directly from the CMU to the serial devices for troubleshooting. The important things to know before using the minicom:
Blue Port Description Red Port Protocol
/dev/ttyB00 Housekeeping Vacuum /dev/ttyR00 9600 8N1
/dev/ttyB01 Housekeeping Thermometer /dev/ttyR01 2400 8N1
/dev/ttyB02 Balluff /dev/ttyR02  
/dev/ttyB03 Filter Wheel 1 (lower)
closest to primary
/dev/ttyR03 9600 8N1
/dev/ttyB04 Filter Wheel 2 (upper)
closest to dewar
/dev/ttyR04 9600 8N1
/dev/ttyB05 Rotator /dev/ttyR05  
/dev/ttyB06 Rotator Backlash Recovery /dev/ttyR06  
/dev/ttyB07 Goya /dev/ttyR07 38400 8N1
/dev/ttyB08 Shutter /dev/ttyR08 38400 8N1
/dev/ttyB09 Focuser (L2 motion) /dev/ttyR09 9600 8N1
/dev/ttyB10 Mitutoyo (L2 readout) /dev/ttyR10 9600 8N1

  • If you enter minicom with the -s (settings) option minicom -s , you will not get a failure if it was not left in good configuration
  • When you enter minicom ctrl/A o allows you to set the serial port configuration
  • Always enter ctrl/A e in minicom to enable echoing of commands and answers to the screen
  • If you do not save the configuration, it will go back to the previous settings on exit
  • Sometimes minicom wants commands in CAPITAL LETTERS (only shutter? filter motors and encoders recognize lowercase). If it does not seem to recognize a command you are typing, try it in uppercase.
  • minicom does not recognize DEL or BS as such. So, if you mistype a command just hit <ENTER> and type it again
  • see IT 2532 for good notes on shutter troubleshooting with minicom
  • see also: LBC Software Testing notes for filters

Running IDL Standalone

The LBC instrument control software uses IDL to perform zernike corrections. The same IDL software is used for the DOFPIA program.

See http://wiki.lbto.org/bin/view/Software/LBCSoftwareDescription#IDL_Code for descriptions of the main functions.

The IDL code used by the real-time system is also installed in /home/lbcobs/LBCFPIA/lbcfpia/src . You need to be logged in to a machine with access to the IDL programs and the FITS files. This example assumes you log in as lbceng to one of the mountain OBS machines. You have to have a tec config file local for the examples below.
setenv IDL_DIR '/lbt/astronomy/idl'
setenv IDL_PATH '+/lbt/astronomy/idl/lib:+/lbt/astronomy/stow/astron/astron:+/home/lbcobs/LBCFPIA/lbcfpia/src:+/home/lbcobs/LBCFPIA/lbcfpia/lib/mpfit'
cp /home/lbcobs/LBCFPIA/lbcfpia/src/lbcfpia_tec.cfg .
idl

Use a specific dated directory to find files, use a time that you know will find some files that have dates greater than the input date.
in this example, the cur_time is 02:34:11 (023411), it finds files 023249 through 024242
    IDL> dotecia, 1, 'r', 100.00, datadir='/Repository/20141210/tec', today='20141210', cur_time=23411      

To give it a list of files (make sure you have the updated lbcfpia.pro for this option):
    IDL> dotecia, 1, 'r', 100.00, datadir='/Repository/20141210/tec',  fileList='lbcrtec.20141210.015831_2.fits lbcrtec.20141210.015818_2.fits' 

it uses datadir to get the files, so don't put the full path.

It writes to a local lbcia.log file.

Shutdown for Power Outage

If a power outage is expected, turn OFF the LBC systems, stop the software, and power off the CMU:
  1. from the LBC user GUI, turn OFF all the systems, including housekeeping
  2. login to the CMU as telescope, lbccontrol or root
  3. lbckill in the CMU window
  4. poweroff in the CMU window

Computer Failure

Windows PCs

If we experience a Windows PC failure during the night, it would have to be a motherboard failure to require swapping in one of the spares -- see Procedure for Swapping in a Spare LBC Computer which applies to the Windows PCs. There is a configured spare in the cabinet on 3L. It does not have a CCD controller card, so the failed PC's card will have to be put into the spare machine.

The four computers are configured with RAID, so hard disk failures are not catastrophic failures - the drive should failover to the mirrored drive automatically.
Mountain IT personnel periodically check all the machines on the mountain for orange/yellow lights on (which could mean a failed drive). There is no other way to know we have failed over.

We have drives and power supplies in one of the LBC cabinets on 3L.

The Windows PCs are easy to replace - they are all identical in configuration except for their IP address. The all have the same 3rd party software requirements (see Software Installation Procedure on the lbccontrol site, and click on the Windows links). The installed software does not change. The LBC application software does change, but it is downloaded from the CMU on each reboot, so it does not have to be kept current on the spare.


CMU

The CMU was upgraded to newer hardware and OS in Summer-2014. At that time, we sent the identical spare to Rome for use by INAF for the development of the new BeagleBone CCD controllers. This machine is still in Rome, so we do not currently have a CMU spare all configured and ready to go. However, since it's now a current OS and hardware, it can be rebuilt on almost any machine. To rebuild a new machine, we have captured a snapshot of the system (Sep-2015). This snapshot is available on google drive or also copied to cmuSystem.tar in /home/ksummers/lbc in Tucson. This file is a tar file containing compressed snapshots of the system filesystems of the cmu machine, suitable for restoration with restore . It does not include the /lbccontrol and /images partitions - these can be recreated from SVN.

/lbccontrol/version would be built from SVN according to: Software/LBCBuildAndInstall

/images files can be copied from /lbccontrol/version/src/windowspc/images

Using low-level SW to reset the controller

There are some times when a controller reset is needed. Turning off and back on again "other systems" will do this, but it is slow, taking 5-6 minutes as the windows PCs boot up. A more direct way to do this follows:
  • Login to the CMU as user root (you need to be root to make changes using the power.py GUI).
  • Issue an lbckill to stop the LBC SW daemon.
  • Run up the power.py GUI and click "Refresh". The power.py GUI is in /lbccontrol/current/ and for each computer or mechanism, there will be two buttons: on and off.
  • On the power.py GUI, the Red CCD controller should be ON (green background on the ON button). Click the OFF button to turn it off and verify by clicking "Refresh" (red background on the OFF button). Then Click the ON button to turn the controller back on and verify by clicking Refresh.
  • Finally, issue an lbcstart to restart the SW.
  • In the LBC user interface, you should see the message that the systems are restarting after a crash and after about a minute all systems should report "Ready".

Notes while Troubleshooting Specific Problems

BlueFilterWheel1Error (IT 7294)

LBCR electronics: vertical smearing and counts topping out well below 65k (IT 8643, IT 4114, IT 6504)

These issue traks describe the problem where the peak count level tops out around 32000 (not exactly 2^14, but close) and both stars and background are smeared vertically. This may affect one and not all chips, and only a portion of the chip. The problem seems to arise spontaneously (there was some thought that it was related to a change of readout configuration, but it doesn't always follow that) and, once present, persists. To recover, it is necessary to reset the controller. This may be done by:
  • turning off and back on again "other systems". This will take ~5-6 minutes, the time to boot the windows PCs.
  • using low-level software to reset the controller:
    • Login to the CMU as user root (you need to be root to make changes using the power.py GUI).
    • Issue an lbckill to stop the LBC SW daemon.
    • Run up the power.py GUI and click "Refresh". The power.py GUI is in /lbccontrol/current/ and for each computer or mechanism, there will be two buttons: on and off.
    • On the power.py GUI, the Red CCD controller should be ON (green background on the ON button). Click the OFF button to turn it off and verify by clicking "Refresh" (red background on the OFF button). Then Click the ON button to turn the controller back on and verify by clicking Refresh.
    • Finally, issue an lbcstart to restart the SW.
    • In the LBC user interface, you should see the message that the systems are restarting after a crash and after about a minute all systems should report "Ready".

I Attachment Action Size Date Who Comment
LBCArchitecture.jpgjpg LBCArchitecture.jpg manage 3 MB 04 Dec 2013 - 15:07 UnknownUser LBC Architecture
LBCArchitecture.pngpng LBCArchitecture.png manage 85 K 03 Dec 2013 - 23:59 UnknownUser LBC Architecture
LBCArchitecture.tiftif LBCArchitecture.tif manage 2 MB 04 Dec 2013 - 15:04 UnknownUser  
RedDataProbePowerCables-sm.JPGJPG RedDataProbePowerCables-sm.JPG manage 192 K 19 Sep 2016 - 16:30 UnknownUser Red dataprobe power cables
RedElectronicsBox-bottom-sm.JPGJPG RedElectronicsBox-bottom-sm.JPG manage 420 K 19 Sep 2016 - 16:30 UnknownUser Bottom of the red electronics cabinet, where dataprobes are powered
Topic revision: r39 - 28 Sep 2022, OlgaKuhn
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback