TCS Reflective Memory Troubleshooting
The LBT Reflective Memory system provides the mechanism for storing state and status for all subsystems. It consists of definition files, and the
DataDictionary,
ReflectiveMemory and
SetValue parts of the Common Software.
All computers have local access to the same data. The local memory is implemented as a shared memory segment, and a network demon keeps all the segments updated. Only the owning subsystem may write its reflective memory, but anyone can read it. (But see
DDAccess
below.)
Troubleshooting
A problem with the reflective memory is usually manifested as a GUI turning red because values are stale, or a GUI freezing, or in rare instances a single value not updating when it should.
As of 2013, the only problems we currently see with this system are:
- timeouts related to deadlocks occurring when the token packet is received out of order. The system automatically recovers from these errors by restarting.
- sometimes permissions problems can occur if servers are started by different users
- machines can sometimes get "vanished" from the shared memory ring and make the
gshmserver
get stuck
The tools used for diagnosing problems with the reflective memory are
gshmconfig
and
gshmmonitor
gshmconfig
The
gshmconfig
program provides configuration and control of the reflective memory system.
This program should not not be used by the mountain staff; they should always use the TCSGUI.
gshmconfig
may be used by experts on certain conditions.
Usage: gshmconfig command subsystem
Configure the gshm server for global memory
Command Description
------- -----------
help, -h Print usage information
list, -l List all current parameters
start Start the server
stop Stop the server
broad, -b Get the broadcast address for gshm to use
key, -k Get the key for the shared memory
size, -s Get the size of the shared memory
reset, -r Reset the mutexes in the memory
inspect, -i Inspect the mutexes in the memory
remove, -m Remove the shared memory segement
zero, -z Zero the shared memory segement
ZERO, -Z Zero the shared memory segement even if in use
param, -p Set GshmServer parameters (in any order; unspecified will not be changed)
-v verbose_flag -t timeout (ms) -l late_data_threshold (ms) -m max_lock_tries -d token delay (us)
To determine
which machines have been configured, are on-line running the
TCS, and are connected
to the shared memory network,
log in to the TO station (
obs1
) as
telescope
and run the program:
ssh telescope@obs1
[telescope@obs1 ~]gshmconfig -l
KEY: 54
SIZE: 1024
ID Host
-- -------
0 tcs1
1 tcs2
2 tcs3
The
gshmconfig
program is also how you start/stop the
TCS reflective memory system. (See
Recovery below)
gshmconfig
can be useful in debugging a "stuck" server. If the
gshmserver
has to be killed manually (
netconfig stop
doesn't work) and you see the machine being taken out of the shared memory ring, it's possible it is waiting on a mutex. You can see the shared memory mutexes with the
-i
option of
gshmconfig
. If you see them all set, clear them and try again to stop/start.
gshmmonitor
The
gshmmonitor
executable is built as part of the Reflective Memory component of
Common Software.
gshmmonitor
outputs information whenever a
server is added or dropped from the reflective memory network. It also tracks
drop, hello, and restart requests over the network. This tool can be run in the background to keep track of servers coming into and going out of the network.
Usage:
gshmmonitor supports one flag argument. Use just one '-' and group all the letters together.
't' shows when a token is received.
'd' shows when data is received.
'a' performs timing analysis on data packets.
'c' shows reflective memory cycle time.
'h' prints this message.
'm' mail to user in tcs.conf when reflective memory is down.
'o' prints output to screen.
's' prints output to Syslog.
This program is invoked on the command line as
$
gshmmonitor -o
where the "-o" argument allows diagnostic messages to be written to
both standard output and the SYSLOG.
Upon invocation, you should see
Fri Jun 28 18:02:05.819 2013 Recognized computer obs1 with id 0 on network
Fri Jun 28 18:02:05.820 2013 Recognized computer obs2 with id 1 on network
Fri Jun 28 18:02:05.820 2013 Recognized computer obs3 with id 2 on network
Fri Jun 28 18:02:05.820 2013 Recognized computer obs4 with id 3 on network
Fri Jun 28 18:02:05.821 2013 Recognized computer obs5 with id 4 on network
Fri Jun 28 18:02:05.822 2013 Recognized computer tcs1 with id 5 on network
Fri Jun 28 18:02:05.822 2013 Recognized computer tcs2 with id 6 on network
Fri Jun 28 18:02:05.822 2013 Recognized computer tcs3 with id 7 on network
Fri Jun 28 18:02:05.823 2013 Recognized computer tcs4 with id 8 on network
Fri Jun 28 18:02:05.823 2013 Recognized computer tcs5 with id 9 on network
The information written to SYSLOG includes a timestamp and the
diagnostic messages. Basically, you would just keep this program running, and
become concerned when you see
Request to drop id 2 from id 3
Computer obs3 droppped from network!!
and you
DO NOT see a corresponding
Recognized computer obs3 with id 2 on network.
Another bad thing to see is
Bad Network State. Request to reset all gshm servers from id 3.
and then you
DO NOT see the servers coming back on-line.
And
Computer obs1 vanished from network
which means the computer took itself out of the ring.
The
gshmmonitor
can generate
copious diagnostic messages in the SYSLOG. The diagnostic messages are controlled by a boolean variable,
gshmdebug
,
in the
tcs.conf
file.
# GSHMMONITOR
emailcontact string cbiddick@lbto.org
gshmdebug bool false
The
emailcontact
contains the email address of the person to be notified by the monitor program when a problem is detected (if the "-m" argument to the
gshmmonitor
is used).
Note: Please do NOT use the -m
argument at this time.
DDAccess
DDAccess
is used to write any arbitrary reflective memory variable with data. It can also read a group of reflective memory variables repeatedly, as well as list reflective memory.
DDAccess Version 1.1
Usage: DDAccess -w <name> <value>
Usage: DDAccess -r [-u <n>] <name1> <name2> <name3> ... <name10>
Usage: DDAccess -l [<sub>]
Usage: DDAccess -p
Usage: DDAccess -f <name>
Usage: DDAccess -h
Command Description
------- -----------
-w <name> <value> Write reflective memory 'name' with 'value'.
'name' may be either public or private.
-r [-u <n>] <....> Read reflective memory repeatedly. Without -u read once.
Optional 'n' (min 100) is the number of milliseconds between reads.
'namex' may be either public or private. Exit loop with ^C
-l [<sub>] List private reflective memory names. Optional 'sub' will limit list to that subsystem.
-p List public names.
-f <name> Check if 'name' exists. May be either public or private.
-h Print usage information.
ipcs / ipcrm
ipcs
is a Linux command to see interprocess communication information. It shows the shared memory segments, the shared semaphores, and shared queues (which we do not use). Use
ipcrm
to remove items.
If you see shared memory sections owned by a user other than
tcs
, this could be a problem. The permissions are likely
777
so it shouldn't be necessary, but to clean up, you can stop all the GUIs on the machine and remove the sections. On restart, they will be created with the right user id.
Example troubleshooting exercise This is out of date as of October 2016.
0) You notice that the MCSGUI is not updating on the TO Station. On the TO
Station type the following in a terminal window to
determine which servers the TO station
sees
$
gshmconfig -l
You should get a list something like
KEY: 54
SIZE: 1024
ID Host
-- -------
0 tcs1
1 tcs2
In the ideal case, the tcs1 and tcs2 should see all of the working servers and workstations, as well as itself. If the TO Station does
NOT list
all of the working servers and workstations known to be on the network, then
there is a problem. In this case, tcs2 is missing from the list.
The advice is
first to use
gshmconfig -l
on all of the
servers and
workstations (or at least on the machine missing from the above list) to
determine what they
see, and then decide on the best course
of action.
For our example, if the TO Station does not
see a particular server, it may
be because that server has been dropped off the shared memory network. It is
now necessary to determine which subsystems are running on this particular
server.
1) Invoke
$
netconfig -l
in a terminal window to obtain the subsystems and the associated servers.
In this example, we can see that
MCS is running on tcs2:
subsystem server id cmd ip shm ip
--------- ------ -- ------ ------
PMCR tcs1 5 192.168.58.16 192.168.58.16
OSS tcs1 5 192.168.58.16 192.168.58.16
AOSL tcs1 5 192.168.58.16 192.168.58.16
PSFR tcs2 6 192.168.58.17 192.168.58.17
PCS tcs2 6 192.168.58.17 192.168.58.17
MCS tcs2 6 192.168.58.17 192.168.58.17
IIF tcs2 6 192.168.58.17 192.168.58.17
PMCL tcs3 7 192.168.58.18 192.168.58.18
LSS tcs3 7 192.168.58.18 192.168.58.18
ENV tcs3 7 192.168.58.18 192.168.58.18
PSFL tcs4 8 192.168.58.19 192.168.58.19
GCSR tcs4 8 192.168.58.19 192.168.58.19
GCSL tcs4 8 192.168.58.19 192.168.58.19
ECS tcs5 9 192.168.58.20 192.168.58.20
2) From a terminal window on the TO Station, log onto the server missing from the
list generated in Step 0 (tcs2) via
ssh
$
ssh tcs2
3) On the server type
$
gshmconfig -l
This will show you the other machines/servers this
machine
sees. You will get a list something like this:
KEY: 54
SIZE: 1024
ID Host
-- -------
0 tcs1
In the ideal case, the current machine should see all the otherworking servers (tcs1 and tcs2). If this machine does NOT list the other server (tcs2), then you will need tostop and restart the shared memory server on
AT LEAST this machine.
Recovery
Stop and restart the shared memory server on the machine with the problem. You can do this with the
TCSGUI by specifying the machine name in the "Network Servers" box at the top-left and doing a
Stop
and then
Start
.
To stop and restart the network servers manually on the machine with the problem, see
Using Manual Tools.
History
Some of the problems we've seen in the past, but do not occur anymore:
- machines drop from the shared memory network
- heavy use of resources (i.e., percentage of idle time or a large "load average") cause problems
- more than one server is using the same ID (seen in the
gshmconfig -l
output)
Note that this is still possible, but will not happen if the system is started using the TCSGUI. The problem can occur if you issue netconfig start
simultaneously on several servers.
Identification and Fix for the Problem of Machines Dropping Off the Distributed Shared Memory Network Sept-2006</verbatim></verbatim>