TCS Reflective Memory Troubleshooting

The LBT Reflective Memory system provides the mechanism for storing state and status for all subsystems. It consists of definition files, and the DataDictionary, ReflectiveMemory and SetValue parts of the Common Software.
All computers have local access to the same data. The local memory is implemented as a shared memory segment, and a network demon keeps all the segments updated. Only the owning subsystem may write its reflective memory, but anyone can read it. (But see DDAccess below.)

Troubleshooting

A problem with the reflective memory is usually manifested as a GUI turning red because values are stale, or a GUI freezing, or in rare instances a single value not updating when it should.

As of 2013, the only problems we currently see with this system are:
  • timeouts related to deadlocks occurring when the token packet is received out of order. The system automatically recovers from these errors by restarting.
  • sometimes permissions problems can occur if servers are started by different users
  • machines can sometimes get "vanished" from the shared memory ring and make the gshmserver get stuck

The tools used for diagnosing problems with the reflective memory are gshmconfig and gshmmonitor

gshmconfig

The gshmconfig program provides configuration and control of the reflective memory system.
This program should not not be used by the mountain staff; they should always use the TCSGUI.
gshmconfig may be used by experts on certain conditions.
Usage: gshmconfig command subsystem
Configure the gshm server for global memory

Command         Description
-------         -----------
help, -h        Print usage information
list, -l        List all current parameters
start           Start the server
stop            Stop the server
broad, -b       Get the broadcast address for gshm to use
key, -k         Get the key for the shared memory
size, -s        Get the size of the shared memory
reset, -r       Reset the mutexes in the memory
inspect, -i     Inspect the mutexes in the memory
remove, -m      Remove the shared memory segement
zero, -z        Zero the shared memory segement
ZERO, -Z        Zero the shared memory segement even if in use
param, -p       Set GshmServer parameters (in any order; unspecified will not be changed)
                  -v verbose_flag -t timeout (ms) -l late_data_threshold (ms) -m max_lock_tries -d token delay (us)

To determine which machines have been configured, are on-line running the TCS, and are connected to the shared memory network,
log in to the TO station (obs1) as telescope and run the program:

 ssh telescope@obs1
 [telescope@obs1 ~]gshmconfig -l

 KEY:    54
 SIZE:   1024
 ID       Host
 --      -------
 0       tcs1
 1       tcs2
 2       tcs3

The gshmconfig program is also how you start/stop the TCS reflective memory system. (See Recovery below)

gshmconfig can be useful in debugging a "stuck" server. If the gshmserver has to be killed manually (netconfig stop doesn't work) and you see the machine being taken out of the shared memory ring, it's possible it is waiting on a mutex. You can see the shared memory mutexes with the -i option of gshmconfig . If you see them all set, clear them and try again to stop/start.

gshmmonitor

The gshmmonitor executable is built as part of the Reflective Memory component of Common Software. gshmmonitor outputs information whenever a server is added or dropped from the reflective memory network. It also tracks drop, hello, and restart requests over the network. This tool can be run in the background to keep track of servers coming into and going out of the network.

Usage:
  gshmmonitor supports one flag argument. Use just one '-' and group all the letters together.
  't' shows when a token is received.
  'd' shows when data is received.
  'a' performs timing analysis on data packets.
  'c' shows reflective memory cycle time.
  'h' prints this message.
  'm' mail to user in tcs.conf when reflective memory is down.
  'o' prints output to screen.
  's' prints output to Syslog. 

This program is invoked on the command line as
$ gshmmonitor -o
where the "-o" argument allows diagnostic messages to be written to both standard output and the SYSLOG. Upon invocation, you should see

Fri Jun 28 18:02:05.819 2013 Recognized computer obs1 with id 0 on network
Fri Jun 28 18:02:05.820 2013 Recognized computer obs2 with id 1 on network
Fri Jun 28 18:02:05.820 2013 Recognized computer obs3 with id 2 on network
Fri Jun 28 18:02:05.820 2013 Recognized computer obs4 with id 3 on network
Fri Jun 28 18:02:05.821 2013 Recognized computer obs5 with id 4 on network
Fri Jun 28 18:02:05.822 2013 Recognized computer tcs1 with id 5 on network
Fri Jun 28 18:02:05.822 2013 Recognized computer tcs2 with id 6 on network
Fri Jun 28 18:02:05.822 2013 Recognized computer tcs3 with id 7 on network
Fri Jun 28 18:02:05.823 2013 Recognized computer tcs4 with id 8 on network
Fri Jun 28 18:02:05.823 2013 Recognized computer tcs5 with id 9 on network

The information written to SYSLOG includes a timestamp and the diagnostic messages. Basically, you would just keep this program running, and become concerned when you see
Request to drop id 2 from id 3
Computer obs3 droppped from network!!
and you DO NOT see a corresponding
Recognized computer obs3 with id 2 on network.
Another bad thing to see is
Bad Network State.  Request to reset all gshm servers from id 3.
and then you DO NOT see the servers coming back on-line.
And
Computer obs1 vanished from network
which means the computer took itself out of the ring.

The gshmmonitor can generate copious diagnostic messages in the SYSLOG. The diagnostic messages are controlled by a boolean variable, gshmdebug, in the tcs.conf file.
# GSHMMONITOR
emailcontact            string          cbiddick@lbto.org
gshmdebug               bool            false 

The emailcontact contains the email address of the person to be notified by the monitor program when a problem is detected (if the "-m" argument to the gshmmonitor is used).
Note: Please do NOT use the -m argument at this time.

DDAccess

DDAccess is used to write any arbitrary reflective memory variable with data. It can also read a group of reflective memory variables repeatedly, as well as list reflective memory.

DDAccess Version 1.1

Usage: DDAccess -w <name> <value>
Usage: DDAccess -r [-u <n>] <name1> <name2> <name3> ... <name10>
Usage: DDAccess -l [<sub>]
Usage: DDAccess -p
Usage: DDAccess -f <name>
Usage: DDAccess -h

Command                 Description
-------                 -----------
-w <name> <value>       Write reflective memory 'name' with 'value'.
                        'name' may be either public or private.
-r [-u <n>] <....>      Read reflective memory repeatedly. Without -u read once.
                        Optional 'n' (min 100) is the number of milliseconds between reads.
                        'namex' may be either public or private. Exit loop with ^C
-l [<sub>]              List private reflective memory names. Optional 'sub' will limit list to that subsystem.
-p                      List public names.
-f <name>               Check if 'name' exists. May be either public or private.
-h                      Print usage information.

ipcs / ipcrm

ipcs is a Linux command to see interprocess communication information. It shows the shared memory segments, the shared semaphores, and shared queues (which we do not use). Use ipcrm to remove items.

If you see shared memory sections owned by a user other than tcs, this could be a problem. The permissions are likely 777 so it shouldn't be necessary, but to clean up, you can stop all the GUIs on the machine and remove the sections. On restart, they will be created with the right user id.

Example troubleshooting exercise This is out of date as of October 2016.

0) You notice that the MCSGUI is not updating on the TO Station. On the TO Station type the following in a terminal window to determine which servers the TO station sees
$ gshmconfig -l
You should get a list something like

KEY:    54
SIZE:   1024
ID       Host
--      -------
0       tcs1
1       tcs2

In the ideal case, the tcs1 and tcs2 should see all of the working servers and workstations, as well as itself. If the TO Station does NOT list all of the working servers and workstations known to be on the network, then there is a problem. In this case, tcs2 is missing from the list.

The advice is first to use gshmconfig -l on all of the servers and workstations (or at least on the machine missing from the above list) to determine what they see, and then decide on the best course of action.

For our example, if the TO Station does not see a particular server, it may be because that server has been dropped off the shared memory network. It is now necessary to determine which subsystems are running on this particular server.

1) Invoke
$ netconfig -l
in a terminal window to obtain the subsystems and the associated servers.
In this example, we can see that MCS is running on tcs2:
subsystem  server         id  cmd ip           shm ip
---------  ------         --  ------           ------
PMCR       tcs1            5  192.168.58.16    192.168.58.16
OSS        tcs1            5  192.168.58.16    192.168.58.16
AOSL       tcs1            5  192.168.58.16    192.168.58.16
PSFR       tcs2            6  192.168.58.17    192.168.58.17
PCS        tcs2            6  192.168.58.17    192.168.58.17
MCS        tcs2            6  192.168.58.17    192.168.58.17
IIF        tcs2            6  192.168.58.17    192.168.58.17
PMCL       tcs3            7  192.168.58.18    192.168.58.18
LSS        tcs3            7  192.168.58.18    192.168.58.18
ENV        tcs3            7  192.168.58.18    192.168.58.18
PSFL       tcs4            8  192.168.58.19    192.168.58.19
GCSR       tcs4            8  192.168.58.19    192.168.58.19
GCSL       tcs4            8  192.168.58.19    192.168.58.19
ECS        tcs5            9  192.168.58.20    192.168.58.20 

2) From a terminal window on the TO Station, log onto the server missing from the list generated in Step 0 (tcs2) via ssh
$ ssh tcs2
3) On the server type
$ gshmconfig -l

This will show you the other machines/servers this machine sees. You will get a list something like this:

KEY:    54
SIZE:   1024
ID       Host
--      -------
0       tcs1

In the ideal case, the current machine should see all the otherworking servers (tcs1 and tcs2). If this machine does NOT list the other server (tcs2), then you will need tostop and restart the shared memory server on AT LEAST this machine.

Recovery

Stop and restart the shared memory server on the machine with the problem. You can do this with the TCSGUI by specifying the machine name in the "Network Servers" box at the top-left and doing a Stop and then Start .

To stop and restart the network servers manually on the machine with the problem, see Using Manual Tools.

History

Some of the problems we've seen in the past, but do not occur anymore:
  • machines drop from the shared memory network
  • heavy use of resources (i.e., percentage of idle time or a large "load average") cause problems
  • more than one server is using the same ID (seen in the gshmconfig -l output)
    Note that this is still possible, but will not happen if the system is started using the TCSGUI. The problem can occur if you issue netconfig start simultaneously on several servers.

Identification and Fix for the Problem of Machines Dropping Off the Distributed Shared Memory Network Sept-2006</verbatim></verbatim>
Topic revision: r10 - 22 Aug 2019, PetrKubanek
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback