Troubleshooting Procedures for MODS team and LBTO staff

These procedures are documented for use by MODS team and LBTO staff. They were separated from the Troubleshooting page that is available on the Partner Observing wiki and available to all observers.

NB: The isis commands can be issued from the lbto account on the obsN computer, or from the mods1data or mods2data machines. If you run these from modsNdata, you do not have to enter the MODS id on the command line (e.g., the -m 1 or the --mods1 ).

Power cycling the guider and WFS camera controllers

The ability to power cycle the guider and WFS camera controllers used to be on the MODS GUI, in the Utilities page. Since the beginning of 2013B, a new version of the MODS GUI has been in use, and it no longer has the buttons to turn off and on these controllers. These functions must now be done from the command line.

Logged in as lbto to an obsN workstation,
  • to determine the status of the MODS1 guider or WFS camera contollers, type
    • isisCmd --mods1 m1.ie util agc or similarly, isisCmd m1.ie util agc -m 1
    • isisCmd --mods1 m1.ie util wfs or similarly, isisCmd m1.ie util wfs -m 1
    • if it is on, it will respond as in the following example for the agc:
lbto@obs3:2 % isisCmd --mods1 m1.ie util agc
DONE: UTIL AGC=ON AGC_BRK=OK
    • when either of these is off, the breaker state cannot be determined and is reported as UNKNOWN.
lbto@obs3:6 % isisCmd -m 1 m1.ie util agc 
DONE: UTIL AGC=OFF AGC_BRK=UNKNOWN
  • to turn on the guider or WFS camera controller on MODS1, type
    • isisCmd -m 1 m1.ie util agc on or
    • isisCmd -m 1 m1.ie util wfs on
  • and to turn off the guider or WFS camera controller on MODS1, type
    • isisCmd -m 1 m1.ie util agc off or
    • isisCmd -m 1 m1.ie util wfs off

Instrument Authorization fails with GCS error

At the beginning of the night or after reconfiguring to MODS, the instrument authorization may sometimes fail with a GCS error. In this case,
the OSA and/or support astronomer will need to assist.

The diagnostic and troubleshooting steps below are a first line of attack, but the problem could be deeper (e.g. a failed power supply, switched or not well seated fibers: some relevant issue tracks are ITs 8365, 8132, and 7517. These are just the MODS-AGW ITs, but similar problems have arisen for PEPSI and LUCI). For more details, the AGWCheatSheet is a good place to start.
  • VNC into the azcamserver to see if there are any error messages in the terminal windows entitled 5g, 5w, 6g and 6w. (5g and 5w are for MODS1 and 6g and 6w, for MODS2).
    • vncviewer azcamserver
  • If there are errors, record these and reboot the azcamserver from the Windows Start menu (select "Restart") at the lower left.
    • After this, the OSA will need to restart GCS.
    • You can VNC back into the azcamserver and watch the dialog in the window to see if the error reoccurs.
  • If there is still an error, then reboot mods1-cam or mods2-cam. These are in CRB rack 13. After that, reboot the azcamserver and restart GCS.
  • If there is still an error after rebooting both mods1-cam (or mods2-cam) and the azcamserver and restarting GCS, then there may be a more serious problem.

Power cycling the HEB

Here is the procedure for power-cycling an HEB. MODS1 red is used in this example. Replace 1 with 2 and red with blue as indicated.

Simplest Solution (you are on-site and can get to the KVMoIP console):

Use the KVMoIP console to connect to mods1data:

isisCmd m1.ie util heb_r off

... count to 30 ...

isisCmd m1.ie util heb_r on

And then using the KVMoIP console, get to the M1.RC keyboard/monitor mode, quit from the IC program (type "quit"), then restart it (type "ic" at the DOS prompt).

If all goes well, this does the full restart and you are done.

If you are off-site and/or cannot reach the KVM:

1) Power cycle the HEB

Logged in as lbto@obsN:

isisCmd -m 1 m1.ie util heb_r off

... count to 30 ...

isisCmd -m 1 m1.ie util heb_r on

Then re-init the controller with the IC program still running

isisCmd -m 1 m1.rc seqinit

this should work the first time, but about half the time it doesn't and you see the message, *** ERROR: SEQINIT Can't Initialize UARTs. Head Electronics UARTs are disabled. Try SEQINIT. EXPSTATUS=DONE. It may take 2-10 repeats before it "catches". We don't know why...

When you get SEQINIT to succeed, send these commands to make sure the HEB is setup properly

isisCmd -m 1 m1.rc tedpower on

isisCmd -m 1 m1.rc igpower on

isisCmd -m 1 m1.rc ledpower off

then you should be able to type

isisCmd -m 1 m1.rc estatus

and see real status info.

These are more-or-less the steps (less seqinit) that the ccdInit.pro script executes, so you could as easily run that script with execMODS instead of doing the other bits after SEQINIT.

Similarly from the GUI console, you could issue the commands as

"red seqinit"

or from modsCmd if the GUI is running

modsCmd red seqinit

(This is mods2Cmd for MODS2.)

As you can see, if you can just stop/restart the IC program after HEB power cycling, life is usually better.

On a few rare occasions the start/stop IC step will have the sequencer not init correctly. In that case, you have to do the SEQINIT commands to get it to take, then issue the three power commands.

The reason for the three power commands is that the default modes on power-up are to have the internal thermo-electric cooler (the "TED") off. With the TED off, the inside of the box gets hotter than it would be otherwise (it is part of the internal box cooling system). The IGPOWER command turns on the vacuum ionization gauge. If this is off, you'll get bogus pressure readings on the dewar vacuum vessel. By default, LED power is off, but this just makes sure (the LEDs are diagnostic LED readouts on the HEB proper - you don't want them shining inside the instrument. This is a very remote possibility, so this is a sanity check).

Swapping in a spare IC

There should be hot spares of the 3 types of MODS computers in the MODS rack in CRB: (1) the mods instrument server (modsN machine); (2) the mods data server (modsNdata); and (3) a MODS Instrument Control Computer (IC).

Instructions for swapping in a spare IC, in this case for M1.BC, as documented in IT 7238, are below.

  1. Power down the MODS 1 computers
  2. From the back of the rack. disconnect all the cords, cables, etc. from the back of the Spare IC. From left to right, it probably goes: power cord, keyboard, mouse, serial cable, VGA cable, SCSI cable, fiber optic cable. Pay extra attention that the fiber optic cable is hanging clear of moving parts and that it does not get pulled, twisted, kinked, pinched or otherwise abused during the entire swap. Stop and double check if you aren’t sure the fibers are safe.
  3. From the front of the rack, pull the spare computer straight out on its rails until it clicks into place.
  4. Find the release toggle on the rail on each side. Push the tab on the left one up, and the tab on the right one down while pulling the computer straight out and free of the rails. The computer weighs 50 pounds, so be sure you are ready to support that weight when it comes free. Get assistance if you aren’t sure.
  5. Put the spare computer somewhere out of the way.
  6. Remove the MODS 1 Blue IC from the shipping box.
  7. Line up the rails on the IC with the rails sticking out of the rack and push the computer in.
  8. If the computer hangs up partway into the rack, pull it back out until the locking tabs click and then push back in.
  9. Reconnect the cords and cables to the MODS 1 Blue IC.
  10. Power up the MODS 1 computers and restart the software.
    • Through the Raritan, you should be able to connect to M1.BC to verify that the IC program is running, and to start it, if it is not.
    • If you see "No video signal", then check that the video cable from the spare IC has been plugged into the correct VGA connector.
  11. From the console of the mods data machine (mods1data in this case), restart the appropriate caliban (cb_blue in this case).
  12. Put the Spare IC in the shipping box, seal up the box, put it downstairs in the lobby area and let Bonnie or Kara and Jerry Mason (OSU) know that it is ready to ship back to OSU.
  13. Configure the IC just installed to serve as the CCD Control Computer for the specific channel, e.g. MODS1B, MODS1R, MODS2B or MODS2R.
    1. See these Instructions for configuring an IC which Jerry provided.

MODS Raritan (KVM)

The MODS Raritan (KVM) is in the MODS rack in CRB. The picture below shows the KVM console pulled out and ready to use (this picture is of the old, pre-August 2021, raritan). MODSrack.jpg

The Raritan can be used directly in CRB and, when off-site, it can be accessed through a web interface. Using a Google Chrome browser (with firefox, it does not work and I have not tested it with other browsers), go to the URL 192.168.139.19. You need to be on the LBTO network to access the page. Login as user mods and using the standard "mods" credentials. This should bring you to the KVM port selection menu. Left-click on the name of the machine whose console you wish to view and interact with. This will bring up a "connect" button; click again on it and a new window should popup which will show you the console screen after some display adjustments.

In the Computer Room B (check whether this procedure is still valid for the new raritan): Pull out and unfold the LCD monitor and keyboard and turn it on. (Two taps on the Scroll Lock key will bring up the KVM port selection menu. Use the touch pad at the lower right of the keyboard to select the machine you want to monitor, left click, and then left click the "Connect" button that appears on the screen.

When you are done, please remember to disconnect from any port which you have used, as only one connection to a port is allowed.

Re-establishing the sequencer connection

After the HEB is power cycled, or after the fibers from the HE to the Instrument Control Computer have been swapped, the HE-to-sequencer connection for that channel will need to be re-established. To ascertain whether or not the sequencer connection is down, you can look for a red outline around the Housekeeping icon on the left side of the mods GUI and the word "OFF" in red letters to the right of the SEQ in question. If the GUI is not running, then you can check the status by issuing the command:

isisCmd --mods1 m1.bc status

which will return a long line with "+" signs in front of the words SEQ and HE (+SEQ, +HE) if the sequencer connection is good, but "-" signs (-SEQ -HE) if the connection is down.

There are two ways to re-establish the HE to sequencer connection:

Restarting the IC program:
  • Using the Raritan, connect to the channel whose sequencer is down. If it is MODS1 Blue, for example, connect to the console of M1.BC.
  • Most likely there will not be a prompt. Type "quit" and you should see a DOS prompt.
  • Type "ic" at the prompt. Wait about 1-2 minutes and you should see the DEWAR PRESSURE and TEMPERATURES reporting.

OR

Doing a "seqinit":
  • From a terminal logged in as lbto on an obsN computer, type:
    • isisCmd --mods1 m1.bc seqinit
    • if successful, it will return DONE: SEQINIT UARTINIT Done.
    • if not, try again. Sometimes it takes a few tries.
  • Replace the 1 with a 2 for MODS2 instead of MODS1 and the "b" with an "r" for the red instead of the blue channel.

AND

After re-establishing the sequencer connection, remember to reset the power states of the vacuum ionization gauge, the thermo-electric device and the LED. modsNWake does this, but it is best just to issue the following 3 commands after the seqinit, in case you will not be using MODS right away.

In summary, the seqinit and the 3 power commands:

lbto@obs4.2 % isisCmd --mods1 m1.bc seqinit
DONE: SEQINIT UARTINIT Done.
lbto@obs4:3 % isisCmd --mods1 m1.bc tedpower on
DONE: TEDPOWER TEDPower=ON
lbto@obs4:4 % isisCmd --mods1 m1.bc igpower on
DONE: IGPOWER IGPower=ON
lbto@obs4:5 % isisCmd --mods1 m1.bc ledpower off
DONE: LEDPOWER LEDPower=OFF

Both blue and red IEBs need to be powered on even if only one CCD channel is active

Both the Red and Blue instrument [mechanism] electronics boxes, in addition to controlling the mechanisms for their respective channels, also control a share of the mechanisms common to both channels above the focal plane that have hardware interlocks:
  • The Blue IEB controls the calibration tower and AGW stage system (x, y, focus, and filter wheel). The xy stage and calibration tower are interlocked.
  • The Red IEB controls the mask insert and select mechanisms (interlocked), and the hatch and the dichroic select (independent)

Cleaning mods1data and mods2data

MODS1 and MODS2 data are stored under /lhome/data on mods1data and mods2data, respectively. These can fill up several times per semester (/lhome has 131Gb available and a single 8Kx3K MODS image is 49Mb, which means that /lhome can hold just over 650 dual grating binocular MODS images).

Before a run in which MODS will be heavily used, or about once a month, it is good to check the disk space and delete files which have been confirmed to be in the archive, using a script that compares file checksums. On a more frequent basis, it is good to organize the files so that a listing of /lhome/data will turn up only the most recent files: in the past, files were manually sorted into UT date subdirectories but now files are all moved into a subdirectory called images.

Current archive-checking scheme:

Since 2021, we have been using the python script, modsimgchk, to compare checksums on files in modsNdata and in the archive. This script accesses the archive in Tucson and I am running it from shell.tucson.lbto.org. Prior to running the script, I move all of the images to check to a subdirectory of /lhome/data called images, however the organization on modsNdata and the subdirectory naming could be different. A sample command line syntax is as follows:

modsimgchk '/lhome/data/images/' --mods 2 --rstdir '/lhome/data/Clean'

where the first argument is the directory on the modsNdata machine that contains the images to be checked and the last argument indicates the directory into which the images will be moved, sorted into the following subdirectories based on the checksum results:
  • CheckResult.PASS;
  • CheckResult.MISSING;
  • CheckResult.MD5MSMT; and
  • CheckResult.WARPASS.

Old archive-checking scheme (~June 2020 until )

Under /home/modsNdata/bin/ there is a script, modsdatacleaning.sh, to sort the files in /lhome/data into UTdate subdirectories, and then, for every image that is older than N (20 or 30 - configurable) days, compare its checksum on the data server (modsNdata) with the checksum in the archive. Because additional header information is added when the image arrives on /newdata, this script make a temporary copy of the file before running a checksum on it. The script creates directories safetodelete/, checksumfail/ and questionable/ under /lhome/data, and it moves the images whose checksums on modNdata and in the archive agree to safetodelete/, images whose checksums do not agree to checksumfail/ and images which appear on modsNdata but not in the archive to questionable/. A user can safely delete the entire safetodelete/ directory when running low on space. The images in the other two directories will need to be examined to determine the reason for the discrepant checksums or that they did not get ingested into the archive.

Mechanism timeout and red instead of grey background on MODS GUI

The text below describes the symptoms and troubleshooting steps to diagnose and resolve a mechanism timeout error.

Symptom:

A command to move a mechanism ends with a TIMEOUT error, and, on the GUI, the box that usually displays the mechanism position has a red background,

e.g. the "instconfig red grating" command on MODS1 ended with the error: WARNING: **FAULT** - RGRATING RGRATING=TIMEOUT cannot read from 192.168.139.101:8006

Possible areas of failure:
  • comtrol unit: There are 6 comtrol units in each MODS: 3 in the blue instrument electronics box (IEB) and 3 in the red IEB. 2 of the 3 are used to communicate with the MicroLYNX -7 programmable stepper motor controllers that operate all MODS mechanisms. The other one is used to acquire data from the IMCS Germanium quad cell sensor in each CCD Head Electronics Box. [MODS Network Infrastructure Document, 2014 April 14, Pogge, Mason & Gonzalez, I603o00680.pdf]
  • microLynx controller
  • N-tron ethernet switch (IT 7929) in the IEB distributes Ethernet to the comtrol units and the WAGO units in that IEB. When it was failing, we would see the IEB go offline.

  • If only a single mechanism has failed, it is likely that it the microLYNX controller.
  • If multiple mechanisms and there has been a power outage that affected the computer room but not the instrument or if modsNdata has crashed hard, then check the comtrol device first.
  • If multiple mechanisms have failed, then it could also be sympomatic of a faulty N-tron ethernet switch (IT 7929).

Diagnostic steps
  • Run pingMODS# where # is the number of the affected MODS and note whether any systems are OFFLINE
    • All systems ONLINE:
      1. Check the power status of the associated MicroLYNX controller. Use this table of MicroLYNX controller assignments to identify the MicroLYNX controller involved.
      2. Then power cycle it (whether the status is "ON" or "OFF"):
        • isisCmd --modsN mN.ie ieb [b|r] mlc [#] off
        • isisCmd --modsN mN.ie ieb [b|r] mlc [#] on
      3. And reset the affected mechanism:
        • isisCmd --modsN mN.ie rgrating reset (or bgrating reset, bfilter reset)
        • mechanism names appear in the 6th column of the MicroLYNX controller assignments table.
      4. If a power cycle and reset works, then you may be able to continue for the short term, however usually this is a sign of a failing microLYNX controller and it should be replaced or the cable routed to a spare in the IEB within a day or two. Note, however, that Rick Pogge needs to be consulted when a microLynx controller is replaced or a spare used, since they are the only ones who can upload the mechanism-specific code. See Replacing a microLYNX controller below
    • If pingMODS# shows a comtrol unit OFFLINE:
      1. Reset the associated comtrol device. Note that after resetting the comtrol device, the IE service has to be restarted. Rebooting the comtrol device is usually only indicated when modsNdata crashes hard or the power to the computer room has been lost, but the power to the instrument remains. This used to happen more frequently when the computer room UPS did not work well.
    • If pingMODS# shows 3 comtrol units and a WAGO OFFLINE
      1. This could indicate a problem with the Ethernet switch (IT 7929) or the 24V power supply.
      2. Try cycling power to the IEB, after which you'll have to restart the IE service and run the coldStart Script
        • isisCmd --modsN mN.ie ieb_[b|r] off
        • isisCmd --modsN mN.ie ieb_[b|r] on
        • modsN stop ie
        • modsN start ie
        • execMODS --modsN ~modseng/modsScripts/Support/modsNColdStart.pro

Replacing a microLYNX controller

  • This is a daytime task and Rick Pogge needs to be consulted, since they are the only ones who can upload the mechanism-specific code to the spare controller that is swapped in.
  • The procedure is in the Vault "MicroLYNX exchange" 603s112.
  • Ask someone to go up to MODS and check the light next to the affected microLynx. Is it green or red?
    • The telescope needs to be at zenith and the direct Gregorian rotator at the rotator service position for the IEB involved. For MODS1, this is 345 deg for the blue IEB and 285 deg for the red IEB. For MODS2, 165 deg for the blue IEB and 105 deg for the red IEB.
    • Remove the cover of the IEB to reveal the array of microLYNX controllers and the lights next to them.
  • If the light is red, try to power cycle the controller:
    • isisCmd --modsN mN.ie ieb c mlc # off
    • isisCmd --modsN mN.ie ieb c mlc # on
  • If it still is red, then the MicroLYNX controller may need to be replaced or the cable rerouted to use one of the spare MicroLYNX controllers already in the IEB. Please consult the full procedure for this and coordinate the activity with Rick Pogge, since he will need to upload the mechanism-specific code to the spare controller and, if the cable swap has been made, also to edit the mechanism.ini file.

Dewar pressure(s) reading zero

If the pressure in one of the dewars is reading zero or N/A, the ionization gauge (IG) may have had a glitch or may be having some other problem. The ISp's daily instrument monitoring may show this, it is visible on the housekeeping page of the MODS GUI, or output by the command: modsTemps 1 red or modsTemps 1 blue.

Recovery procedure, using mods1 blue for the example. Remember that from the lbto account on obsN, use the --modsN option, but don't use this option if logged in as mods on modsN or modsNdata.

  1. Send command: isisCmd --mods1 m1.bc igpower off
  2. Count to 5
  3. Send command: isisCmd --mods1 m1.bc igpower on
  4. Wait 30+ seconds (updates on a slow internal cycle).
  5. Verify
  6. For mods2, replace --mods1 with --mods2, and m1.bc with m2.bc

HEB temperatures slowly rising

The head electronics boxes (HEBs) have thermoelectric devices on their power supply boards to protect the system against operating at too high a temperature. These devices (TEDs) will cut off the HEB power when the temperature exceeds 45 C. The devices also aid cooling. If instrument cooling has been off and the boxes have reached 45 C, it will take about 1.5 hours for them to cool and reach equilibrium after cooling has been restored. If interested, the MODS1 Red HEB cooldown curve is among the attachments at the bottom of this page.

Upon restoring power to the HEB, the IC process should be restarted. Doing this ought to restore power to the TED as well as the vacuum ionization gauge (IG; see above), but one should check. To check the power state of the TED and IG, issue the following commands (these are for MODS 1 Blue channel, but replace the 1 with a 2 for MODS2 and the "b" with an "r" for Red):

To check the TED power status:
  • type isisCmd --mods1 m1.bc tedpower (logged in as user lbto on an obsN machine)
  • if the thermoelectric device is on, this will return:
"DONE: TEDPOWER TEDPower=ON"

To check the IG power status:
  • type isisCmd --mods1 m1.bc igpower

To power cycle the TED, type the following commands:
  • isisCmd --mods1 m1.bc tedpower off
  • isisCmd --mods1 m1.bc tedpower on

To power cycle the IG, type:
  • isisCmd --mods1 m1.bc igpower off
  • isisCmd --mods1 m1.bc igpower on

Resetting Comtrol Devices on Port Timeouts

Occasionally, we see problems with the serial port interfaces timing out. (This used to occur somewhat frequently when the UPS systems were not robust.) The Comtrol devices can be rebooted via firefox on mods1data or mods2data by bringing up the IP address and clicking the "reboot".

Rebooting the comtrol closes the TCP sockets and their associated serial ports, then does a warm restart (not a power cycle) of the port server proper. It is only indicated in very rare circumstances where the IE program halts because of an abrupt computer crash (e.g., someone pulls the plug on the computer rack) and there are “dangling” sockets. The reboot clears the ports so that after the instrument host computer is restarted, the IE program is restarted, it can connect to the TCP sockets on the comtrol unit in question.

Note that rebooting the comtrol is not necessary after recovering from a mods1data freeze-up, like those we have been seeing lately (IT 7285).

After rebooting the comtrol ports, it is necessary to restart the IE service (mods1 start ie or mods2 start ie)

For example, in Oct-2014, Olga saw the following complaints from the pokeMODS1 script:
mods1data:mods% pokeMODS1
Poking MODS1 Mechanisms...
Common focal plane:
*** ERROR: HATCH HATCH=TIMEOUT cannot write to 192.168.139.101:8001
DONE: CALIB CALIB=OUT
DONE: AGWX AGWXS=89.841
DONE: AGWY AGWYS=33
DONE: AGWFOC AGWFS=-2.000
DONE: AGWFILT AGWFILT=1 AGWFNAME='Clear'
*** ERROR: MSELECT MINSERT=TIMEOUT cannot write to 192.168.139.102:8012
*** ERROR: MINSERT MINSERT=TIMEOUT cannot write to 192.168.139.102:8012
*** ERROR: DICHROIC DICHROIC=TIMEOUT cannot write to 192.168.139.101:8002
Blue Channel:
DONE: BCOLTTFA BCOLTTFA=16853.3
DONE: BCOLTTFB BCOLTTFB=17997.3
DONE: BCOLTTFC BCOLTTFC=20052.3
DONE: BGRATING BGRATING=2 GRATNAME='G400L'
*** ERROR: BGRTILT1 BGRTILT1=TIMEOUT cannot write to 192.168.139.112:8010
DONE: BCAMFOC BCAMFOC=3329
DONE: BFILTER BFILTER=7 FILTNAME='ND1.5'
Red Channel:
DONE: RCOLTTFA RCOLTTFA=15846.5
*** ERROR: RCOLTTFB RCOLTTFB=TIMEOUT cannot write to 192.168.139.101:8004
*** ERROR: RCOLTTFC RCOLTTFC=TIMEOUT cannot write to 192.168.139.101:8005
*** ERROR: RGRATING RGRATING=TIMEOUT cannot write to 192.168.139.101:8006
*** ERROR: RGRTILT1 RGRTILT1=TIMEOUT cannot write to 192.168.139.101:8007
*** ERROR: RCAMFOC RCAMFOC=TIMEOUT cannot write to 192.168.139.102:8010
*** ERROR: RFILTER RFILTER=TIMEOUT cannot write to 192.168.139.102:8009

From any browser on a machine that is connected to the mountain network, open the URLs to the IP addresses and click the Reboot button at the bottom of the page:

Comtrol Port assignments

Each of the two blue and two red IEB comtrol units has 8 ports. The port-to-microLYNX table is here: https://wiki.lbto.org/Instrumentation/MicroLYNX
192.168.139.101 MODS1 Red IEB comtrol 1 8001-8008
192.168.139.102 MODS1 Red IEB comtrol 2 8009-8016
192.168.139.103 MODS1 Red IMCS comtrol 8017-8018
192.168.139.111 MODS1 Blue IEB comtrol 1 8001-8008
192.168.139.112 MODS1 Blue IEB comtrol 2 8009-8016
192.168.139.113 MODS1 Blue IMCS comtrol 8017-8018
 
192.168.139.201 MODS2 Red IEB comtrol 1 8001-8008
192.168.139.202 MODS2 Red IEB comtrol 2 8009-8016
192.168.139.203 MODS2 Red IMCS comtrol 8017-8018
192.168.139.211 MODS2 Blue IEB comtrol 1 8001-8008
192.168.139.212 MODS2 Blue IEB comtrol 2 8009-8016
192.168.139.213 MODS2 Blue IMCS comtrol 8017-8018

when fitsflush does not recover everything...

As soon as the observer notices that the data transfer has stalled (index(last_file) < index(next_file)-1) they need to stop taking data. The first recourse is to run fitsflush:

  1. Pause the current exposure (the script itself cannot be paused)
  2. In the Command Window, type "blue fitsflush" or "red fitsflush", depending on which channel is stuck.
  3. Wait about a minute as the files are transferred. Sometimes all of the files transfer and the index number of the last file is still not that of the next file minus 1. This depends on the transfer disk on which it was stored. You can check in /newdata or in the log that modsDisp prints to the screen as it displays each new file that all of the files have been transferred.

If fitsflush does not transfer all of the files, then you may need to go a bit deeper, following these instructions which have been written for the case where the transfer has stalled on the blue channel of MODS2:

  1. Pause the script (stop taking new data!
  2. Login to mods2data as user mods
  3. Stop the Caliban with stuck data (mods2 stop cb_blue)
  4. Start the Caliban agent with the stuck data (mods2 start cb_blue) – this opens the CB/Blue console on your desktop
  5. In the CB window scrape 50 images from DISK1 and DISK2 by typing the following commands at the CB/Blue console prompt. This will take some time, and when it runs out of images to transfer, it will complain about no data by complaining about no END card in header in red text.
    • recover 50 disk1
    • recover 50 disk2
  6. Quit out of cb_blue (type “quit”) then restart the Caliban
  7. Take a single test image (short bias) on the blue to confirm restart
  8. resume observing

modsDisp not displaying latest images which are on /newdata

Note that, if the files are not on /newdata, and the difference between the index number for the next image and the last image, as displayed on the MODS dashboard, is greater than one, then this would be a problem of data transfer which "blue/red fitsflush" should resolve (see Red or Blue images not appearing in /newdata).

This discussion refers to the situation where the files are in /newdata but are not being displayed by modsDisp. modsDisp looks for the files MODSxx.new in /newdata which should contain the names of the latest MODSxx images.

Check the permissions on the *.new files in /newdata. This can be done from any machine or account that can read /newdata:

     ls -lag /newdata/*.new

the permissions should be "-rw-rw-rw-", e.g.:

lbto@obs4:17 % ls -lag /newdata/*.new
-rw-rw-rw-. 1 nobody 26 Oct 28 18:05 /newdata/MODS1B.new
-rw-rw-rw-. 1 nobody 26 Oct 28 18:04 /newdata/MODS1R.new
-rw-rw-rw-. 1 nobody 26 Sep  5 15:16 /newdata/MODS2B.new
-rw-rw-rw-. 1 nobody 26 Sep  5 15:15 /newdata/MODS2R.new
  • If all 4 files exist and have the correct permissions, nothing more needs to be done.
  • If all 4 files exist but one or more have "-rw-r--r--" permissions set, then the files are readonly to the MODSn data archiver, and cannot be overwritten for subsequent images.
    • Corrective:
      • Login to the archive machine as root (Stephen Hooper, and Riccardo Smareglia and Cristina Knapic in Trieste, can do this).
chmod 0666 /path/to/newdata/MODS*.new

This sets the permissions for all four files. It is harmless to reset permission on a file, so no need to be selective.

  • Never delete MODS*.new files with -rw-r--r-- permissions. While the MODSn data archiver will be able to write these files the first time, the default readonly permissions mean that all subsequent overwrites by the MODSn data archiver will be blocked.
  • If one or more of the MODSxx.new files are missing completely, then the following corrective is required:
Example: MODS1R.new and MODS1B.new are missing from /newdata

   Login to the archive machine as root
   cd /path/to/newdata
   touch MODS1R.new
   touch MODS1B.new
   chmod 0666 MODS1*.new

the "touch" commands create zero-length placeholder files, chmod sets
their permissions.

Without the "touch" steps, you have to wait until (in this example) MODS1 writes files of the missing type, and then have someone with root privileges on standby to change the file mode bits before the next images are written, which is not really practical.

MODS images not appearing in /newdata

The newdata disk is automounted to mods1data and mods2data. If newdata has been taken offline, it may not automatically get mounted again on these machines.

To check that the newdata disk is mounted to mods1data and mods2data, ssh to the machine as user mods (ssh mods@mods1data or ssh mods@mods2data), and then type: = df -k=. You should see something like the following output where the last line shows that the newdata is mounted as /archive/data. If this line is missing, then type cd /archive/data. This should cause the disk to be mounted.
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md1              15116744   9620684   4728160  68% /
/dev/md0                101018      9041     86761  10% /boot
none                    777844         0    777844   0% /dev/shm
/dev/md3             136599440 107925376  21735200  84% /lhome
192.168.39.20:/volume6/lbto_repository
                     18730543424 17146537920 1584005504  92% /nfs/192.168.39.20/repository
archive-in:/mnt/newdata
                     11521350568 161226288 11360124280   2% /archive/data

Current Sensing Relay (CSR) failure

If the HEB cannot be turned on remotely and shows a breaker fault:

lbto@obs2:41 % isisCmd --mods1 m1.ie util heb_r on
DONE: UTIL HEB_R=ON HEB_R_BRK=FAULT

  • First try just to turn off and then back on the HEB.
lbto@obs2:43 % isisCmd --mods1 m1.ie util heb_b off
DONE: UTIL HEB_B=OFF HEB_B_BRK=UNKNOWN
lbto@obs2:44 % isisCmd --mods1 m1.ie util heb_b on
DONE: UTIL HEB_B=ON HEB_B_BRK=FAULT
  • If that doesn't work, check that the HE power toggle switches on the sides of each IUB are in the AUTO/REMOTE position. They need to be in this position for remote control. These are 3-position switches (REMOTE, ON, OFF) located on the IUBs next to the fiber connections for the HE. See the picture (on_off_remote_switch_m2r.png.jpg) of this switch for MODS2 red.
  • If they are, and the error persists, cycle the HE breakers in the IUBs (see breakers_m2r.jpg). When they do trip, it's hard to tell that they are not ON. So cycle them OFF and then back ON again.
  • If that does not help, then go to the HE power toggle switch on the side of the IUB and switch from AUTO to ON, just as a test to see if the HEs can be powered up through the circuit breaker. If they come on then, we need to troubleshoot the start logic (CSRs).
  • If, with the switch in the ON position to bypass remote operation, the HE can be turned on, one can proceed in this way for the rest of the night, but the CSR should be replaced at the earliest opportunity. With the switch in the ON position, the HEB is susceptible to thermal power cycling ("thrashing") if there is a prolonged loss (> ~30min) of glycol cooling; when the HEB warms to ~45 C, the thermal protection thermostat will open, shutting off AC power to the HEB. In "remote" mode, the hold-in/drop-out circuit will insure that the HEB stays powered off after the HEB internal temperature drops below ~45 C, but when that is bypassed and the switch is ON, the HEB will power back on after it has cooled. But then unless instrument cooling has been restored, it will warm up again, power off, etc. We can operate safely in this manual override mode provided there is no prolonged loss of glycol cooling at the telescope. If cooling is lost, we advise manually powering off the HEB and not restarting it until cooling is restored. This must be done at the utility box on the instrument with the telescope at zenith and the LDG rotator at 345 deg for MODS1 or the RDG rotator at 165 degrees for MODS2 (the utility box access orientation), with both elevation and LDG or RDG rotator locked out for safety.
  • The procedure for replacing a failed CSR is 603s104b.
  • ITs on this topic: IT 4351, 4671, 5973, 6866, 7200, 8448, 8782.
  • on/off/remote switch in the MODS2 IUB:
    on_off_remote_switch_m2r.png.jpg
  • HEB breakers in the MODS2 IUB:
    breakers_m2r.jpg
  • red/green LEDs:
    • red LED signifies either:
      • a fault state (HEB cannot be switched remotely and shows a breaker fault); or
      • the HEB is on (switched remotely or remote-switch bypassed)
      • CSR1_bad_HEB_faultstate.png
    • green LED signifies:
      • the CSR is working; and
      • the HEB is off
      • green LED flashes on momentarily during the hold-in period when switching the relay on
      • CSR1_replaced_HEB_off.png

Readout anomalies

2013 September (IT #4813 and IT #4814):

The sieve mask images (sieveSnap) that are taken each afternoon should be checked for any readout anomalies. The first images after summer shutdown (20130902) showed obvious horizontal smearing of the spots within +/- 300 pixels from the center of the image, X=1544 for the 3K x 3K region of interest (ROI) (see attached image, mods1r.2013_0706.0003_v_0901.0006.jpg, which compares pre- and post-summer shutdown images, zoomed about the region where the horizontal smearing starts). Upon further investigation, this smearing occurred only for the 3K x 3K and 4K x 3K regions of interest; the sieve spots on the 8K x 3K and 1K x 1K ROI images were round and sharp (see the attached image, mods1r_readout.png).

This problem required assistance from the MODS team, specifically Rick Pogge and Jerry Mason.

Troubleshooting proceeded as follows:
  • stopped all mods services and restarted these, following the procedures for Starting Up the Instrument Control Software.
  • obtained sieve Mask images using different regions of interest. Noted for what ROIs the horizontal smearing appeared.
  • power cycled the red HEB (see below)
  • rebooted the red IC (quit and start the IC program on the M1.RC computer).
  • swapped the fiber connectors at the DOS computers in CRB to see whether the problem arises from the host computer (i.e. the sequencer) or the HEB.
    • in this case, found that it arises from the host computer.
  • swapped the sequencer card from the M1.RC computer currently in use to the spare IC.
    • this did not help
  • installed a new sequencer card into the original M1.RC computer.
    • this did help
  • On 11 September 2013, a new sequencer card was installed in the original M1.RC computer. The spare IC remains in the rack in CRB, ready to be used if and when necessary.
  • On 23 September 2013, there was a problem with quadrant 1 showing only bias levels, but no response to light. James Riedl reseated boards in the red HEB. Since that time, no further problems have been seen with quadrant 1, but we saw a different manifestation of a problem with the red sequencer - this time, the sieve spots were not horizontally smeared, but instead there were horizontal bands covering all columns and a range of rows about the central one.
  • The two MODS2 "flight" sequencer boards were shipped from OSU, and, on 30 September 2013, one of these was installed in M1.RC. About 20 images were taken with both channels and the images look good.

Sensor Bit Patterns

The below is merely added to help interpret the bit patterns. No troubleshooting, just reading what the sensors are telling us.

MSELECT

Read from left to right: the first bit is in the in-position bit and the remaining encode the cassette position, with positions 1-12 being represented by bit patterns that sum to the position-minus-one and positions 13-24 being represented by bit patterns that sum to the position.

Examples:

Cassette Position BITS
3 100010
17 110001

MINSERT

1010 mask and grabber are in STOW and by the cassette
0101 mask and grabber are in the FPU and near the FPU
0010 mask position is EMPTY and the grabber is by the cassette
Anything else is bad, as for example, when the grabber detached from the mask in MODS2 (IT 5834):

isisCmd --mods1 m1.ie minsert rdbits
DONE: MINSERT SLITMASK=5 BITS=1001 b24-b21

A good example: When the imaging mask (mask 3) is in the focal plane, then:
lbto@obs4:2 % isisCmd --mods2 m2.ie mselect rdbits
DONE: MSELECT SLITMASK=3 BITS=100010 b36-b31
lbto@obs4:3 % isisCmd --mods2 m2.ie minsert rdbits
DONE: MINSERT SLITMASK=3 BITS=0101 b24-b21

GRATING

bit-pattern   grating position grating name
p10xx  
1000 0+1 1 flat
1010 2^0 + 1 2 grating
1100 2^1 + 1 3 prism
1110 2^1 + 2^0 + 1 4 empty

Files have 3-, not 4-, digit index numbers

After rebooting M1.RC and M2.BC, we regularly see MODS filenames with 3- and not the usual 4- digit index numbers. The files are OK, however for uniformity we would like to restore the usual 4-digit index numbers.

One workaround to force a 4-digit index number is to issue the isis command (the example below will force the 4-digit index number for MODS1 Red channel files):

isisCmd --mods1 m1.rc use mount 3

It will return the status and hang...you'll need to Cntrol-C to return to the prompt.
STATUS: USE Path=UNIX:\CB\/lhome/data-

^CisisCmd interrupted by Ctrl+C

  • “use mount 0” selects the first IC floppy drive,
  • “use mount 1” selects the second IC floppy drive,
  • “use mount 2” selects the IC hard drive, and
  • “use mount 3” selects whatever path Caliban is set up to present to the IC as its path.
These are low-level diagnostic commands observers are not meant to see, but it does force the exp counter to relax the DOS 3-digit extension rule, so we’re using it for now. Click "Update" at the lower left of the dashboard on the GUI to verify that the change has been made.

Because “use mount” is a low-level command, it predates the ICIMACS v.2x rule the states that a completed task must return a DONE: keyword, so isisCmd, which is v.2x compliant, never gets the DONE: keyword it is expecting and waits forever, until you interrupt it with the Ctrl-C.

mod1data freezes

Every few months we experience this problem where mods1data freezes (IT 7285).

The symptoms are that the GUI freezes up. An IT alert email is sent out with the subject:

PROBLEM - no SSH response mods1data.mods.lbto.org

The only recourse is to reboot mods1data. After that, you'll need to restart the services that run on mods1data and you should also restart the MODS GUI in order to reset any state flags left over when mods1data froze.

Go onto the raritan if possible and connect to the mods1data console. (If for some reason you cannot get to the raritan, you can issue these from an ssh session as user mods on mods1data, however when you start the calibans, they will each run in a terminal window on your screen and exiting from these terminal windows will take down the caliban processes.) Then issue the commands:

Guide images not being displayed (this is under construction)

If guide images suddenly stop being displayed:
  1. Related IT 8356
  2. Open a VNC session to the azcam computer: vncviewer azcamserver (lbtguider)
    1. Look for errors in the terminal windows. The 5g and 5w windows run the MODS1 guide and WFS cameras and the 6g and 6w windows, the MODS2 guide and WFS cameras.
    2. If there are errors you may need to either restart the azcam program (click the "star" icon) or reboot the azcam computer.
  3. It may also be necessary to restart MODS{1,2}-cam
    1. If MODS{1,2}-cam needs to be restarted, then this will have to be followed by a reboot of the azcamserver and a restart of GCS.
  4. Ask the OSA to restart GCSL and GCSR

Script hangs on "CLEARSTARS"

  • Check the status of the MODS tcs service
    • In a terminal window, run mods1 status or mods2 status
    • All of the services should be "Running". Note whether the lbttcs service is stopped or running.
    • If it is stopped, then restart it:
      • ssh mods@mods1data or mods@mods2data
      • mods1 start tcs or mods2 start tcs

Ne lamp not working

Sometimes the Ne lamp fails to turn on. While we have seen this problem with both MODS, it seems to be happening much more frequently in MODS2 at this time (Jan 2022). Since the Ne lamp is often used in combination with Hg[Ar], the failure of the Neon lamp to come on may escape notice. The figure below compares raw MODS2 dual grating spectra taken with both Ne and Hg (top) and with Hg only (below). The many closely-spaced Ne lines in the right (bluer) half of the spectrum are missing when the Ne lamp failed to come on.

NeLamp_not_working.png

IMCSLOCK fails

The IMCSLOCK can fail for several reasons. Follow the flowchart to recover:

When the IMCSLOCK times out, the script pauses with the query: abort, retry or ignore?
  • Enter "r" to retry. Most of the time, in particular at low elevations, an IMCSLOCK timeout occurs because the loop hasn't closed within the timeout period, and a "retry" will give it the extra time needed.
  • If that doesn't work, then
    • Is the laser on? Type irlaser in the GUI command window. You should see that the laser is ON, ENABLED and the power out (POUT ~= 1.0 mW).
    • If the laser is ON but DISABLED? Did you wake MODS?
    • If the laser is ON, ENABLED and POUT = 0.0, then try to set the power:
      • Type irlaser power 1.0 in the command line window of the GUI.
      • If that doesn't work then a full power cycled of the lamp and laser box is indicated. In the command line window of the GUI, type:
        • irlaser disable
        • irlaser off
        • In a terminal window, type isisCmd --mods# m#.ie util llb off
        • Wait 10 seconds...
        • Then type isisCmd --mods# m#.ie util llb on
        • Back in the GUI command line window:
        • irlaser on
        • irlaser enable
        • irlaser power 1.0
  • If a full power cycle of the lamp and laser box hasn't resolved the problem (give it 2-3 tries...), then there may be something else.
  • Log in to mods#data as user mods:
    • Type imcsTools start which will launch 4 windows, a table of counts per quadrant and a IR laser "bulls-eye" for each channel, blue and red. The bias level counts are ~200, and when the shutter is open and the laser is detected on the quad cell, counts go up to 1000 - 16838 per quadrant. The counts in the red channel are usually much lower than in the blue, so don't be alarmed by that. If the shutter is open and the laser is on and enabled with a power output of about 1 mW, yet counts remain around 200, then somthing may be blocking the light (we have had this problem with moths blocking the light). You can try increasing the power output to 2 mW to confirm. If an obstruction is suspected, then an instrument specialist may need to open up the bottom hatch and investigate. In the best case, a moth will fly out and the problem is resolved. (Try to include photos or a better description here).

MODS log files

Machine Filename Contents
mods1 (instrument control) /home/mods/Logs/agw.log  
mods1 /home/mods/Logs/mmc.log  
mods1 /home/mods/Logs/Env/mods1.20140106.log Environmental sensor data - temps and pressures
mods1data /Logs/Caliban/cb_blue.log
/Logs/Caliban/cb_red.log
details for file transfers
mods1data /Logs/Data/20140106.log Project, channel filename, object info.
mods1data /Logs/ISIS/isis.20140106.log IS runtime log

I Attachment Action Size Date Who Comment
CSR1_bad_HEB_faultstate.pngpng CSR1_bad_HEB_faultstate.png manage 289 K 01 Feb 2023 - 23:01 OlgaKuhn CSR (MODS1 Blue) when the HEB was showing a FAULT and could not be remotely switched on.
CSR1_replaced_HEB_off.pngpng CSR1_replaced_HEB_off.png manage 241 K 01 Feb 2023 - 23:02 OlgaKuhn CSR1 (MODS1 Blue) working. HEB switched off.
HEB_cooling_and_TED.pngpng HEB_cooling_and_TED.png manage 369 K 01 Oct 2019 - 14:38 OlgaKuhn HEB cooldown curve, from the MODS1 Comm Report
MODS IC Data Taking Configuration.pdfpdf MODS IC Data Taking Configuration.pdf manage 106 K 15 Dec 2022 - 16:47 OlgaKuhn Instructions for configuring a MODS CCD Control Computer to work with a specific channel (MODS1R, MODS2B, ...)
MODSrack.jpgjpg MODSrack.jpg manage 72 K 02 Oct 2019 - 21:39 OlgaKuhn MODS Raritan in CRB
NeLamp_not_working.pngpng NeLamp_not_working.png manage 2 MB 31 Jan 2022 - 23:19 OlgaKuhn  
breakers_m2r.jpgjpg breakers_m2r.jpg manage 5 MB 17 Apr 2018 - 09:06 OlgaKuhn Breakers in the MODS2 IUB
gwc.shsh gwc.sh manage 438 bytes 04 Nov 2019 - 19:51 OlgaKuhn Shell script to report number of modsN files of a given UTdate in a listing downloaded from the archive
on_off_remote_switch_m2r.png.jpgjpg on_off_remote_switch_m2r.png.jpg manage 6 MB 17 Apr 2018 - 09:08 OlgaKuhn HEB on/off/remote switch in the MODS2 IUB
Topic revision: r53 - 02 Feb 2023, OlgaKuhn
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback