Identification and Fix for the Problem of RPC Commands Not Being Registered on the Servers and/or TO Station

The purpose of this document is to aid in identifying the RPC registration problem, as well as to provide guidance as how to address the problem.

Michele De La Pe\xF1a 11 September 2006

Update 10 November 2010

This problem has not happened for a long time, although we did not do anything particluar to fix it.

Identification

Unfortunately, we do not have a lot of information or documentation with respect to this particular problem. Typically, the problem manifests itself when trying to issue a command from a GUI. The GUI will be active in that the user can click on command buttons, but the commands are not actually issued. From the user standpoint the GUI appears to be frozen (sound familiar?). Since the ECSGUI is used nearly all the time, it is a good way to diagnose this problem. Initially done as a temporary debugging tool that was not removed from the code, the ECSGUI issues to stdout the status from the command handle. When the RPCs are not registered, if you started the GUI manually in a terminal window, you will see

Got the cmdStatus2: INVALIDHANDLE.

If you check SYSLOG (/var/log/messages), you should see

Sep 11 12:24:29 lbtdu20 LBT_ECSGUI: No RPC server found on 10.144.0.105 at port 5001

and/or

Sep 11 12:30:22 lbtdu20 LBT: unknown rpc call for: ECS:ECS:SetNightTimeOperationOn:execute

In this example, ECS is running on server 105.

The problem seems to be that when the network server was originally started on the server/TO Station, the RPC server was not able to register with the host. When the network server first starts up on a machine, you see something like the following in the SYSLOG

Sep 11 12:35:36 lbtdu20 LBT_Networkserver: adding rule for TEL to only run on lbtmu102
Sep 11 12:35:36 lbtdu20 LBT_Networkserver: Starting
Sep 11 12:35:36 lbtdu20 LBT_Networkserver: starting rpcserver on 10.144.0.105

However, if there are problems in the initialization stage of the RPC server, you may instead see a message in the SYSLOG regarding RPCs not being able to register. (Sorry I do not have an exact example.) Presumably, if you type

$ rpcconfig -l

on the server having problems, you will NOT see the seventeen registered functions per subsystem and the associated registered aliases. Under good conditions, for the ECS subsystem you would see

ADDRESS         #FUNCTIONS      #ALIASES
----------      ----------      ---------
10.144.0.105       17              1154

We know when the network server starts up, there is significant load on the servers and/or network. If network servers are started on all of the TCS servers and the TO Station in rapid sequence, the load or ??? seems to interfere with proper functioning of the RPC server in particular. The bottom line is the RPC server is not properly set up on the particular TCS machine. This problem will be lurking quietly until a subsystem running on the server attempts to issue commands.

Fix

It may be adequate to stop and then restart the RPC server on the TCS servers having difficulty. You will also have to stop the subsystems on the various problem TCS servers, and then restart the subsystems when the TCS servers are fixed.

On each problem TCS Server...
$ rpcconfig stop
$ rpcconfig start 10.144.0.nnn
where nnn is the server number (i.e., 105 for lbtmu105)

From the TO Station...
$ netconfig stop all

Once all of the subsystems (or just selected subsystems) have been stopped, you can restart each subsystem via the buttons on the TO Station.

-- MicheleDeLaPena - 11 Sep 2006
Topic revision: r4 - 10 Nov 2010, ChrisBiddick
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback