Identification and Fix for the Problem of Machines Dropping Off the Distributed Shared Memory Network
The purpose of this document is to aid in identifying a shared memory problem
, as well as to provide
guidance as how to address the problem. As a baseline, the user should know which machines on the
mountain are on and running the TCS before
See Shared Memory Troubleshooting
for more details on current troubleshooting and how to use the gshmmonitor
The following text conventions are used in this document.
Descriptive information is in this font.
Words you should type are in a fixed font.
Computer responses are in a bold, fixed font.
Michele De La Pe\xF1a 19 September 2006
Table of Contents
The TO Stations are the machines where the GUIs, by default, are run.
Tucson TO Station: rm580h-1
Servers: rm580h-1 is currently running the whole TCS
Mountain TO Station: obs1
Mountain Workstations: obs2, obs3, obs4, obs5, obs6
Servers: tcs1, tcs2
The assumption in these directions is that you are running the TCS
as user telescope
Under the assumption that the TCS
is working properly, you can determine
which machines have been configured, are on-line running the TCS
, and are connected
to the shared memory network by doing
the following on the Mountain TO Station (obs1):
should be working properly after a new install of the software system, and
the Configuration Manager has restarted all of the usable systems.
The above command will show all of the machines/servers the current
-- including itself. One should get a list
something like this:
Full Discussion of the Shared Memory Problem
At this time there is a problem with shared memory which manifests itself as a GUI or GUIs freezing
If a GUI is completely frozen, it is most likely because of the TO station no longer sees
the shared memory for the subsystem associated with the GUI. For example,
the ECSGUI appears to be frozen (i.e., it is no longer updating) on the Mountain TO
Station, obs1, because of the server (i.e., tcs5) upon which the ECS
Subsystem is running has been dropped from the shared memory network.
Alternatively, the Mountain TO Station, obs1, has been dropped from the
shared memory on the server upon which the corresponding subsystem is running.
Please note that often when this starts to happen, the entire shared memory
network can deteriorate and may require stopping and starting the
shared memory on all the servers, as well as the Mountain TO Station and
any other workstations being used.
Identifying the Problem
While a frozen GUI is indicative of the shared memory
problem, there are
better ways to pinpoint that, indeed, you are seeing this particular
problem. The two methods used to identify definitively that the shared memory
problem is happening are:
A) use of the gshmmonitor
program, and B) manual use of gshmconfig
in conjunction with a GUI freezing. It is instructive to use both of these
methods, and they are described in detail in the next section.
It has been observed that heavy use of resources (i.e., percentage of idle time
or a large "load average" as reported by top
) on the Mountain TO Station can trigger
the shared memory problem. For example, the Web Cameras which are used to view various
parts of the observatory display their images in a browser running on the TO Station.
One camera, in particular, Web Camera 4 (using its original camera settings), in conjunction
with all of the other processes running on the machine, would use all
available idle time on the machine to display an image with a variety of intensity levels
(versus a black image which took far fewer resources). Through empirical analysis, it
was observed that by toggling the use/non-use of Web Camera 4, the resourses on the TO
Station were correspondingly saturated and released. The gshmmonitor
program was in use
at the same time, and the servers were seen to drop and reconnect to the shared memory
Further, when using the gshmmonitor
program during a test period, the servers had been observed to
disconnect and reconnect in rapid succession. This does not seem to be a reasonable situation,
but during the test period, the servers always reconnected. We are looking for the situation
when a server fails
to reconnect; the observed behaviour might be a precursor to a server
failing to reconnect.
Finally, it has been speculated that the heavy resource use on the workstations/servers
causes the network to be busy, and that this is actually what is causing the problem.
The shared memory server polls the workstations/servers on the network to ensure they
are running. If the network traffic is high and/or the workstations/servers are busy, it
is possible that any individual machine does not respond in the specified amount of time.
As such, the shared memory server attempts to recover the machine, and this (under certain
circumstances) may start the decline of the shared memory network.
Update 15 September 2006
Attempted to diagnose the "shared memory" problem. Loading the TO Station
such that it had 0% idle time using five (!) PCSGUIs was not sufficient to
trigger machines being dropped and reacquired on the network. Only when we
used a high resolution, web camera with a fast refresh rate were we able
to trigger the drop/reacquire cycle as observed using "gshmmonitor -o".
I also changed LBT.conf on all the servers and the TO station to
"gshmdebug true"; this allows the gshmserver to issue verbose diagnostic
messages. No messages were issued in any of the SYSLOGs.
Torsten speculated that perhaps a "user-level" process is not sufficient
to be the root cause. Rather a kernel-level process would have priority
to grab resources and perhaps this is a clue. It should be noted
that since Alex fixed Web Camera 4 to be lower resolution and to not
have such a fast refresh rate (and they do not run the Web cameras
on the TO Station anymore), this problem has not be reported.
Update 22 September 2006
After Paul and I (Tony) did a review of the GshmServer code we found the probable
source of this problem.
The shared memory network is roughly a token ring network. Each GshmServer on the network
is supposed to have a unique Id. These Ids are determined dynamically based roughly on
the order the servers join.
The token nature of the network ensures that servers never broadcast memory data
at the same time. Unfortunately, the code allows a server to broadcast a join message
without appropriate safe-guards to avoid collisions. For instance, two servers who wish
to join at roughly the same time (within ~1 second of each other), will likely send their
join messages at the same time, causing a collision. Because our network is full-duplex,
this collision is not total. Some servers will receive a join from one server, some will receive
a join from another, and some will receive no join notice.
This has two ramifications. One, since the two servers are joining simultaneously, they
will assume the same Id. Two, because of the nature of the collision, the servers' lists
of other server Ids will not be identical. For the way we have implemented the token
ring, the servers must have identical server lists for correct token propagation.
Update 10 November 2010
The shared memory system was extensively re-worked in March 2007, December 2008 and July 2009. The 'join' collisions mentioned in the above update are avoided by always starting the network servers with a few seconds delay between starts. The other problems described here are no longer happening. However, network timeouts still occur which cause the shared memory network to re-initialize itself. This does not cause any operational problems, but it would be nice to uncover the root cause.
Update 13 November 2013
A couple of years ago the cause of the network timeouts was found to be the result of UDP token packets arriving in different order on different servers. It only happens in clusters with three or more servers, and can (but not always!) produce a deadlock condition where machine a is waiting for machine b and machine b is waiting for machine c and machine c is waiting for machine b. When the timeout is detected, a restart is requested. Each server has a slightly longer timeout (5 msec) than each lower numbered one, so usually server 0 sees the timeout first and does a clean restart.
Using gshmmonitor and gshmconfig
See Shared Memory Troubleshooting
for details on how to use the gshmmonitor
-- Main.delapena - 16 Jul 2006