TCS Reflective Memory Troubleshooting
The LBT Reflective Memory system provides the mechanism for storing state and status for all subsystems. It consists of definition files, and the DataDictionary
parts of the Common Software.
All computers have local access to the same data. The local memory is implemented as a shared memory segment, and a network demon keeps all the segments updated. Only the owning subsystem may write its reflective memory, but anyone can read it. (But see
A problem with the reflective memory is usually manifested as a GUI turning red because values are stale, or a GUI freezing, or in rare instances a single value not updating when it should.
As of 2013, the only problems we currently see with this system are:
- timeouts related to deadlocks occurring when the token packet is received out of order. The system automatically recovers from these errors by restarting.
- sometimes permissions problems can occur if servers are started by different users
- machines can sometimes get "vanished" from the shared memory ring and make the
gshmserver get stuck
The tools used for diagnosing problems with the reflective memory are
program provides configuration and control of the reflective memory system.
This program should not not be used by the mountain staff; they should always use the TCSGUI.
gshmconfig may be used by experts on certain conditions.
Usage: gshmconfig command subsystem
Configure the gshm server for global memory
help, -h Print usage information
list, -l List all current parameters
start Start the server
stop Stop the server
broad, -b Get the broadcast address for gshm to use
key, -k Get the key for the shared memory
size, -s Get the size of the shared memory
reset, -r Reset the mutexes in the memory
inspect, -i Inspect the mutexes in the memory
remove, -m Remove the shared memory segement
zero, -z Zero the shared memory segement
ZERO, -Z Zero the shared memory segement even if in use
param, -p Set GshmServer parameters (in any order; unspecified will not be changed)
-v verbose_flag -t timeout (ms) -l late_data_threshold (ms) -m max_lock_tries -d token delay (us)
which machines have been configured, are on-line running the TCS
, and are connected
to the shared memory network,
log in to the TO station (
and run the program:
[telescope@obs1 ~]gshmconfig -l
program is also how you start/stop the TCS
reflective memory system. (See Recovery
can be useful in debugging a "stuck" server. If the
has to be killed manually (
doesn't work) and you see the machine being taken out of the shared memory ring, it's possible it is waiting on a mutex. You can see the shared memory mutexes with the
. If you see them all set, clear them and try again to stop/start.
executable is built as part of the Reflective Memory component of
outputs information whenever a
server is added or dropped from the reflective memory network. It also tracks
drop, hello, and restart requests over the network. This tool can be run in the background to keep track of servers coming into and going out of the network.
gshmmonitor supports one flag argument. Use just one '-' and group all the letters together.
't' shows when a token is received.
'd' shows when data is received.
'a' performs timing analysis on data packets.
'c' shows reflective memory cycle time.
'h' prints this message.
'm' mail to user in tcs.conf when reflective memory is down.
'o' prints output to screen.
's' prints output to Syslog.
This program is invoked on the command line as
where the "-o" argument allows diagnostic messages to be written to
standard output and the SYSLOG.
Upon invocation, you should see
Fri Jun 28 18:02:05.819 2013 Recognized computer obs1 with id 0 on network
Fri Jun 28 18:02:05.820 2013 Recognized computer obs2 with id 1 on network
Fri Jun 28 18:02:05.820 2013 Recognized computer obs3 with id 2 on network
Fri Jun 28 18:02:05.820 2013 Recognized computer obs4 with id 3 on network
Fri Jun 28 18:02:05.821 2013 Recognized computer obs5 with id 4 on network
Fri Jun 28 18:02:05.822 2013 Recognized computer tcs1 with id 5 on network
Fri Jun 28 18:02:05.822 2013 Recognized computer tcs2 with id 6 on network
Fri Jun 28 18:02:05.822 2013 Recognized computer tcs3 with id 7 on network
Fri Jun 28 18:02:05.823 2013 Recognized computer tcs4 with id 8 on network
Fri Jun 28 18:02:05.823 2013 Recognized computer tcs5 with id 9 on network
The information written to SYSLOG includes a timestamp and the
diagnostic messages. Basically, you would just keep this program running, and
become concerned when you see
Request to drop id 2 from id 3
Computer obs3 droppped from network!!
and you DO NOT
see a corresponding
Recognized computer obs3 with id 2 on network.
Another bad thing to see is
Bad Network State. Request to reset all gshm servers from id 3.
and then you DO NOT
see the servers coming back on-line.
Computer obs1 vanished from network
which means the computer took itself out of the ring.
copious diagnostic messages in the SYSLOG. The diagnostic messages are controlled by a boolean variable,
emailcontact string email@example.com
gshmdebug bool false
contains the email address of the person to be notified by the monitor program when a problem is detected (if the "-m" argument to the
is used). Note: Please do NOT use the
-m argument at this time.
is used to write any arbitrary reflective memory variable with data. It can also read a group of reflective memory variables repeatedly, as well as list reflective memory.
DDAccess Version 1.1
Usage: DDAccess -w <name> <value>
Usage: DDAccess -r [-u <n>] <name1> <name2> <name3> ... <name10>
Usage: DDAccess -l [<sub>]
Usage: DDAccess -p
Usage: DDAccess -f <name>
Usage: DDAccess -h
-w <name> <value> Write reflective memory 'name' with 'value'.
'name' may be either public or private.
-r [-u <n>] <....> Read reflective memory repeatedly. Without -u read once.
Optional 'n' (min 100) is the number of milliseconds between reads.
'namex' may be either public or private. Exit loop with ^C
-l [<sub>] List private reflective memory names. Optional 'sub' will limit list to that subsystem.
-p List public names.
-f <name> Check if 'name' exists. May be either public or private.
-h Print usage information.
ipcs / ipcrm
is a Linux command to see interprocess communication information. It shows the shared memory segments, the shared semaphores, and shared queues (which we do not use). Use
to remove items.
If you see shared memory sections owned by a user other than
, this could be a problem. The permissions are likely
so it shouldn't be necessary, but to clean up, you can stop all the GUIs on the machine and remove the sections. On restart, they will be created with the right user id.
Example troubleshooting exercise This is out of date as of October 2016.
0) You notice that the MCSGUI is not updating on the TO Station. On the TO
Station type the following in a terminal window to
determine which servers the TO station sees
You should get a list something like
In the ideal case, the tcs1 and tcs2 should see all of the working servers and workstations, as well as itself. If the TO Station does NOT
all of the working servers and workstations known to be on the network, then
there is a problem. In this case, tcs2 is missing from the list.
The advice is first
on all of the
workstations (or at least on the machine missing from the above list) to
determine what they see
, and then decide on the best course
For our example, if the TO Station does not see
a particular server, it may
be because that server has been dropped off the shared memory network. It is
now necessary to determine which subsystems are running on this particular
in a terminal window to obtain the subsystems and the associated servers.
In this example, we can see that MCS
is running on tcs2:
subsystem server id cmd ip shm ip
--------- ------ -- ------ ------
PMCR tcs1 5 192.168.58.16 192.168.58.16
OSS tcs1 5 192.168.58.16 192.168.58.16
AOSL tcs1 5 192.168.58.16 192.168.58.16
PSFR tcs2 6 192.168.58.17 192.168.58.17
PCS tcs2 6 192.168.58.17 192.168.58.17
MCS tcs2 6 192.168.58.17 192.168.58.17
IIF tcs2 6 192.168.58.17 192.168.58.17
PMCL tcs3 7 192.168.58.18 192.168.58.18
LSS tcs3 7 192.168.58.18 192.168.58.18
ENV tcs3 7 192.168.58.18 192.168.58.18
PSFL tcs4 8 192.168.58.19 192.168.58.19
GCSR tcs4 8 192.168.58.19 192.168.58.19
GCSL tcs4 8 192.168.58.19 192.168.58.19
ECS tcs5 9 192.168.58.20 192.168.58.20
2) From a terminal window on the TO Station, log onto the server missing from the
list generated in Step 0 (tcs2) via ssh
3) On the server type
This will show you the other machines/servers this
. You will get a list something like this:
In the ideal case, the current machine should see all the otherworking servers (tcs1 and tcs2). If this machine does NOT list the other server (tcs2), then you will need tostop and restart the shared memory server on AT LEAST
Stop and restart the shared memory server on the machine with the problem. You can do this with the TCSGUI
by specifying the machine name in the "Network Servers" box at the top-left and doing a
To stop and restart the network servers manually on the machine with the problem, see Using Manual Tools
Some of the problems we've seen in the past, but do not occur anymore:
Identification and Fix for the Problem of Machines Dropping Off the Distributed Shared Memory Network Sept-2006
- machines drop from the shared memory network
- heavy use of resources (i.e., percentage of idle time or a large "load average") cause problems
- more than one server is using the same ID (seen in the
gshmconfig -l output)
Note that this is still possible, but will not happen if the system is started using the TCSGUI. The problem can occur if you issue
netconfig start simultaneously on several servers.