Activating the desired version of the TCS Build (make it the "current" build)

See Swapping the IRTC for swapping the IRC/IRS subsystem

Before you begin:
  • Downtown the cluster computer names are tcs-test
  • On the mountain the cluster computer names are tcs1, tcs2
  • Become user "telescope" on the TO station (robs1 downtown, obs1 mountain) before performing these operations.
  • multixterm is useful for executing a command on multiple computers at the same time. There is a helpful script (multixterm.csh) in /home/telescope that executes:
  multixterm -xc "ssh tcs@%n" tcs1 tcs2

Using Automated Tools

Use the TCSGUI to manage the TCS - this is the preferred method. However, if it doesn't work, or after starting the TCS its behavior is not as expected, use the Manual Tools section to stop and restart the TCS.

1) Shutdown the TCS and verify a completely clean shutdown

1.a) If working on the downtown system, tell the developers what you are doing

Tell them NOT to stop or to start the TCS on the downtown test system until they hear back from you.
1.b) if working on the mountain, CALL THE MOUNTAIN.

Tell the operator what you are doing. Tell them NOT to stop or to start the TCS until they hear back from you.
1.c) Use the TCSGUI to stop all running GUIs except your version of the TCSGUI by clicking the "Stop all GUIs" button.

Note the messages in the xterm windows that will open for each machine in the cluster. GUIs not owned by user tcs can not be stopped. Use the Linux command "killall GUI_name" as root on the appropriate computer to stop all remaining GUIs. Check for GUIs on all machines in the cluster using the command "netconfig ps" in multixterm.
1.d) Use the TCSGUI to stop all running subsystems by clicking the "Stop all subsystems" button in the "Subsystems" box.

Wait for the command to complete (the display box above the "Stop all subsystems" button will change from "Busy" to "Idle"). Check for any subsystems still running. If any subsystems are still running, kill the subsystem using the TCSGUI. Click the "Subsystem status" button and verify that no subsystems are running on any cluster computers. All subsystem boxes should be showing a red background with text of "Stopped".
1.e) Use the TCSGUI to stop the network servers on all cluster machines by clicking the "Stop all" button in the Network Servers box.

A window will pop up for each computer as its servers (syslogserver, networkserver, rpcserver, gshmserver) are stopped. Double check that no servers are running by using the Status box "Status" button with a computer name in the edit box to its right. If any network server will not stop, use "killall -9 server_name" on the appropriate computer. The TCSGUI background will turn blue.
1.f) Close the TCSGUI.

2) Switch to the desired build

Use "lbtswitcher" to display all the available builds. (If the switcher does not show the build you want, its configuration file may need to be updated. Check with a member of the software group in this case.) Select the one you want, and click the "Switch" button to effect the change. The switcher will open an xterm window to one of the tcs servers, change the build on the cluster, and then open an xterm window to tcs2 and switch the build there. Close the switcher.

3) Start the TCS network servers (syslogserver, networkserver, rpcserver, gshmserver)

3.a) Start the TCSGUI.

It will have a blue background but it will display the name of the newly selected build.
3.b) Start the network servers on all computers.

Use the TCSGUI to start the network servers on all computers in the cluster by clicking the "Start all" button in the Network Servers box. This operation will clear the contents of shared memory on each computer. The first server to be started is the syslogserver on tcs2. No other computer should run syslogserver. A window will pop up as the servers are started on each computer. The window should show the three servers. The TCSGUI background should be gray. If it is yellow it is out of sync with the version of the TCS servers that were just started. This is a problem, so contact someone in the software group.

4) Start subsystems

Use the TCSGUI to start the four subsystems LSS, DDS, IIF, and ENV. The other subsystems should be started by the operator as needs dictate. If desired, all the subsystems may be started in sequence by clicking the "Start all subsystems" button.

5) Notify the appropriate users

  • If working downtown, send an email to tcsdiscussion and tell the developers the new build is ready
  • If working on the mountain, send an email to telescopework and instruments, and call the operator to tell him the new build is ready

Using Manual Tools

Always use the Automated Tools instructions to manage the TCS unless you experience problems doing so.

1) Shutdown the TCS and verify a completely clean shutdown

1.a) If working downtown tell the developers what you are doing.

Tell them NOT to stop or to start the TCS on the downtown system until they hear back from you.
1.b) If working on the mountain, CALL THE MOUNTAIN to get authorization to stop TCS and to continue the installation process.

Tell the operator what you are doing. Tell them NOT to stop or to start the TCS until they hear back from you.

As telescope on an obs computer, you can use multixterm (or the multixterm.csh script) to stop and start TCS on all the computers at once.

Note: if multixterm doesn't work, use separate ssh sessions to each of the mountain computers and type the multixterm commands in each of those windows when a multixterm command is mentioned.
1.c) Check to see what is running using multixterm
> netconfig ps
1.d) Stop the TCS, clear TCS state, and clear shared memory.

Stop all GUIs on all servers and all observer stations as shown from the above "netconfig ps" command with "killall GUI_Name" commands. This includes DDViewer, but does not include subsystems or networkserver, rpcserver, gshmserver, or syslogserver. You need those for the next step.

In the TO station window stop all subsystems
> netconfig stop all

Repeat step 1.c to check for still running subsystems. Note: If some subsystems do not stop, do "netconfig kill subsystem_name" in just one window for each one that did not stop.

In the multixterm entry window stop the network servers, clear shared memory, and clear TCS state
> netconfig stop; sleep 1; gshmconfig -z; netconfig -z;  rm /var/tmp/*.conf 

The 'gshmconfig -z' will fail with an error message if any processes are still using shared memory. These processes must be stopped. To find the processes enter the command

> /usr/sbin/lsof | grep 00036

The first column shows the process name that is attached to the segment, and the second column shows the PID. Stop it using "kill -9 pid" on the computer where it is running. Note that only the owner of the memory segment can see this. root can see all, so you may need to run the "lsof" command as root.

After stopping processes attached to the shared memory segment, repeat the "gshmconfig -z" command and verify it does not fail.

In the tcs1 or tcs2 window...
> cd /lbt/tcs/
> ln -sTf new_build_dir current

2) Start the TCS servers (syslogserver, networkserver, rpcserver, gshmserver)

Do the following in the multixterm window, to save some typing, but do not hit "enter" at this time.

> netconfig start 

In each computer's window, after waiting 5 seconds, hit "enter" to actually perform the "netconfig start".

If the network servers are all started at the same time, there will be conflicts in server numbers that will prevent the TCS from working.

Do in the multixterm

> netconfig ps 

and you should see something like

 3325 pts/1    00:00:00 syslogserver
 3326 pts/1    00:00:00 networkserver
 3330 pts/1    00:00:23 rpcserver
 3332 pts/1    00:00:00 gshmserver

for each computer. Note that syslogserver will only start on tcs-test and tcs2. If there are multiiple entries for any process, kill all the processes with "killall -9 server_name". Retry the "netconfig start" and "netconfig ps".

Verify shared memory looks good.

In the multixterm entry window...

> gshmconfig -l

Make sure computers are listed in the same order on all computers.

Verify the shared memory is correctly operating

> gshmmonitor -oa

This writes a list every 10 seconds showing the blocks sent in that 10 seconds. It should be empty when no subsystems are running. If it shows all 1024 blocks (0 - 1023) then system ID 0 must have its gshmserver process stopped and restarted. Use "cntrl c" to exit from the "gshmmonitor -oa" command. To find system 0 use the output from the "gshmconfig -l" command and on the computer listed for ID 0 do

> gshmconfig stop
> gshmconfig start

Then check the "gshmmonitor -oa" command again to see if the block list is empty.

Verify the rpcservers started correctly

> rpcconfig -l

This should generate something like

ADDRESS         #FUNCTIONS      #ALIASES
----------      ----------      ---------
192.168.3.16        0               0

If an error message is printed, the network servers must be stopped and restarted on that computer.

3) Start the TCS subsystems and GUIs.

3.a) If working downtown

Start the TCS subsystems and GUIs to insure all is well. When finished logout.

Send an email notification to "tcsdiscussion" regarding the new build on the downtown test cluster.
3.b) If working on mountain

Start the LSS, DDS, IIF, and ENV subsystems. Notify the operator the telescope is theirs and they should start the TCS subsytems and TCS GUIs.

While these processes are happening, watch the machine windows to insure no errors or weird things are happening.

Send an email notification to "telescopework" and "instruments" regarding the new build on the mountain.
Topic revision: r24 - 22 Aug 2019, PetrKubanek
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback