Identification and Fix for the Problem of Machines Dropping Off the Distributed Shared Memory Network

The purpose of this document is to aid in identifying a shared memory problem, as well as to provide guidance as how to address the problem. As a baseline, the user should know which machines on the mountain are on and running the TCS before problems arise.

See Shared Memory Troubleshooting for more details on current troubleshooting and how to use the gshmmonitor and gshmconfig tools.

The following text conventions are used in this document.
      Descriptive information is in this font.
     Words you should type are in a fixed font.
     Computer responses are in a bold, fixed font.

Michele De La Pe\xF1a 19 September 2006

Table of Contents

Background Information

The TO Stations are the machines where the GUIs, by default, are run.

Tucson TO Station: rm580h-1
Tucson TCS Servers: rm580h-1 is currently running the whole TCS system

Mountain TO Station: obs1
Mountain Workstations: obs2, obs3, obs4, obs5, obs6
Mountain TCS Servers: tcs1, tcs2

The assumption in these directions is that you are running the TCS as user telescope.

Under the assumption that the TCS is working properly, you can determine which machines have been configured, are on-line running the TCS, and are connected to the shared memory network by doing the following on the Mountain TO Station (obs1):
     $ gshmconfig -l
The TCS should be working properly after a new install of the software system, and the Configuration Manager has restarted all of the usable systems. The above command will show all of the machines/servers the current machine sees -- including itself. One should get a list something like this:

KEY: 54 SIZE: 1024 ID Host -- ------- 0 tcs1 1 tcs2

Full Discussion of the Shared Memory Problem

At this time there is a problem with shared memory which manifests itself as a GUI or GUIs freezing. If a GUI is completely frozen, it is most likely because of the TO station no longer sees the shared memory for the subsystem associated with the GUI. For example, the ECSGUI appears to be frozen (i.e., it is no longer updating) on the Mountain TO Station, obs1, because of the server (i.e., tcs5) upon which the ECS Subsystem is running has been dropped from the shared memory network. Alternatively, the Mountain TO Station, obs1, has been dropped from the shared memory on the server upon which the corresponding subsystem is running. Please note that often when this starts to happen, the entire shared memory network can deteriorate and may require stopping and starting the shared memory on all the servers, as well as the Mountain TO Station and any other workstations being used.

Identifying the Problem

While a frozen GUI is indicative of the shared memory problem, there are better ways to pinpoint that, indeed, you are seeing this particular problem. The two methods used to identify definitively that the shared memory problem is happening are: A) use of the gshmmonitor program, and B) manual use of gshmconfig in conjunction with a GUI freezing. It is instructive to use both of these methods, and they are described in detail in the next section.

It has been observed that heavy use of resources (i.e., percentage of idle time or a large "load average" as reported by top) on the Mountain TO Station can trigger the shared memory problem. For example, the Web Cameras which are used to view various parts of the observatory display their images in a browser running on the TO Station. One camera, in particular, Web Camera 4 (using its original camera settings), in conjunction with all of the other processes running on the machine, would use all available idle time on the machine to display an image with a variety of intensity levels (versus a black image which took far fewer resources). Through empirical analysis, it was observed that by toggling the use/non-use of Web Camera 4, the resourses on the TO Station were correspondingly saturated and released. The gshmmonitor program was in use at the same time, and the servers were seen to drop and reconnect to the shared memory network.

Further, when using the gshmmonitor program during a test period, the servers had been observed to disconnect and reconnect in rapid succession. This does not seem to be a reasonable situation, but during the test period, the servers always reconnected. We are looking for the situation when a server fails to reconnect; the observed behaviour might be a precursor to a server failing to reconnect.

Finally, it has been speculated that the heavy resource use on the workstations/servers causes the network to be busy, and that this is actually what is causing the problem. The shared memory server polls the workstations/servers on the network to ensure they are running. If the network traffic is high and/or the workstations/servers are busy, it is possible that any individual machine does not respond in the specified amount of time. As such, the shared memory server attempts to recover the machine, and this (under certain circumstances) may start the decline of the shared memory network.

Update 15 September 2006

Attempted to diagnose the "shared memory" problem. Loading the TO Station such that it had 0% idle time using five (!) PCSGUIs was not sufficient to trigger machines being dropped and reacquired on the network. Only when we used a high resolution, web camera with a fast refresh rate were we able to trigger the drop/reacquire cycle as observed using "gshmmonitor -o". I also changed LBT.conf on all the servers and the TO station to "gshmdebug true"; this allows the gshmserver to issue verbose diagnostic messages. No messages were issued in any of the SYSLOGs. Torsten speculated that perhaps a "user-level" process is not sufficient to be the root cause. Rather a kernel-level process would have priority to grab resources and perhaps this is a clue. It should be noted that since Alex fixed Web Camera 4 to be lower resolution and to not have such a fast refresh rate (and they do not run the Web cameras on the TO Station anymore), this problem has not be reported.

Update 22 September 2006

After Paul and I (Tony) did a review of the GshmServer code we found the probable source of this problem. The shared memory network is roughly a token ring network. Each GshmServer on the network is supposed to have a unique Id. These Ids are determined dynamically based roughly on the order the servers join.

The token nature of the network ensures that servers never broadcast memory data at the same time. Unfortunately, the code allows a server to broadcast a join message without appropriate safe-guards to avoid collisions. For instance, two servers who wish to join at roughly the same time (within ~1 second of each other), will likely send their join messages at the same time, causing a collision. Because our network is full-duplex, this collision is not total. Some servers will receive a join from one server, some will receive a join from another, and some will receive no join notice.

This has two ramifications. One, since the two servers are joining simultaneously, they will assume the same Id. Two, because of the nature of the collision, the servers' lists of other server Ids will not be identical. For the way we have implemented the token ring, the servers must have identical server lists for correct token propagation.

Update 10 November 2010

The shared memory system was extensively re-worked in March 2007, December 2008 and July 2009. The 'join' collisions mentioned in the above update are avoided by always starting the network servers with a few seconds delay between starts. The other problems described here are no longer happening. However, network timeouts still occur which cause the shared memory network to re-initialize itself. This does not cause any operational problems, but it would be nice to uncover the root cause.

Update 13 November 2013

A couple of years ago the cause of the network timeouts was found to be the result of UDP token packets arriving in different order on different servers. It only happens in clusters with three or more servers, and can (but not always!) produce a deadlock condition where machine a is waiting for machine b and machine b is waiting for machine c and machine c is waiting for machine b. When the timeout is detected, a restart is requested. Each server has a slightly longer timeout (5 msec) than each lower numbered one, so usually server 0 sees the timeout first and does a clean restart.

Using gshmmonitor and gshmconfig

See Shared Memory Troubleshooting for details on how to use the gshmmonitor and gshmconfig tools.

-- Main.delapena - 16 Jul 2006
Topic revision: r13 - 22 Aug 2019, PetrKubanek
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback