This is a discussion of problems on the mountain the evening of Wednesday,
20-Jun-2007, and in the early morning of 22-Jun-2007. This has been submitted as
IssueTrak 708, but we want it to be viewed by all.

Chris and Michele

-------------------------------------------------------------------

This is an account of the events on the evening of June 20:

Chris and Michele were on-line with the mountain and each other during Wednesday
evening.  Some follow-up conversation was done the next day with Tony and Tom.

The new build (0706201437) was installed by 17:12. The network servers appeared
to start OK. However, subsystems would not run; they died immediately upon
starting. This included the CSQ, SDB, ECS, PMCL, PSFL, MCS, LSS, TEL. This
continued until about 18:37. There are no core files.

Michele was called around 6:40 pm, and she also had a problem starting the
subsystems. After checking some of the servers and not finding anything that
looked wrong, Michele resorted to the "Full shutdown and restart" scenario. ECS
was then started and some commands executed to ensure the TCS was working in a
nominal manner.

The MCSPU was not running around 18:54. Michele talked to Tom who got it running
around 19:29. The MCS is reporting errors:

  Jun 20 19:29:00 lbtmu103 LBT_MCS: parseXMLBuf error in exerces RPCERROR
  Jun 20 19:29:00 lbtmu103 LBT_MCS: mcspu is returning empty AZDrive reports!
  Jun 20 19:29:00 lbtmu103 LBT_MCS: Argument absStat not found
  Jun 20 19:29:00 lbtmu103 LBT_MCS: Argument enc0stat not found
  Jun 20 19:29:00 lbtmu103 LBT_MCS: Argument enc1stat not found
  Jun 20 19:29:00 lbtmu103 LBT_MCS: Argument enc2stat not found
  Jun 20 19:29:00 lbtmu103 LBT_MCS: Argument enc3stat not found
  Jun 20 19:29:00 lbtmu103 LBT_MCS: Argument servoPosError not found
  Jun 20 19:29:00 lbtmu103 LBT_MCS: Argument positionOffset not found
  Jun 20 19:29:00 lbtmu103 LBT_MCS: Argument enc0position not found
  Jun 20 19:29:00 lbtmu103 LBT_MCS: Argument enc1position not found
  Jun 20 19:29:00 lbtmu103 LBT_MCS: Argument enc2position not found
  Jun 20 19:29:00 lbtmu103 LBT_MCS: Argument enc3position not found

The MCS died several times along in here.

The PCS was started at 19:37.

The CSQ was started at 19:53, and we began to get "Failed block" messages from it:

  Jun 20 20:21:30 lbtmu103 LBT_CSQ: CSQ:Authorize executed
  Jun 20 20:21:30 lbtmu103 LBT_PCS: PCS:TELESCOPE:SetNewInstrument executed
  Jun 20 20:21:30 lbtmu103 LBT_PCS: Change Pointing Model LEFT:
/usr/commonsw/Configuration/PCS/LBCLEFTpointingmodel.dat
  Jun 20 20:21:30 lbtmu103 LBT_PCS: Pointing Model Changed:
/usr/commonsw/Configuration/PCS/LBCLEFTpointingmodel.dat
  Jun 20 20:21:30 lbtmu103 LBT_PCS: PCS:TELESCOPE:SetNewInstrument execution
time: 72 ms at 1182396090126
  Jun 20 20:21:30 lbtmu103 LBT_CSQ: Command called on inactive subsystem PSFR
  Jun 20 20:21:30 lbtmu103 LBT_CSQ: CSQ:Authorize execution time: 392 ms at
1182396090434
  Jun 20 20:21:31 lbtmu103 LBT_CSQ: Failed block with state: FAILED
  Jun 20 20:22:02 lbtmu103 last message repeated 30 times
  Jun 20 20:23:03 lbtmu103 last message repeated 60 times
  Jun 20 20:24:04 lbtmu103 last message repeated 62 times
  Jun 20 20:25:06 lbtmu103 last message repeated 62 times
  Jun 20 20:25:36 lbtmu103 last message repeated 31 times

The "Failed block with state: FAILED" means that a block() was called on a
command handle after the command had finished with a failure status. Chris
thinks the failed command was a getParameter from the LBC for DD variables that
no longer existed (we do not make standard SysLog entries for getParameter
commands). That should not have produced the "Failed block" message, however. So
something is not quite right with the timing of the requests through the CSQ:
the block() requests were coming in late for some reason.

Other IIF/CSQ commands were executing normally, including Presets (at 20:52 -
20:53 for example).

The CSQ was stopped and restarted at 21:00 after which we get errors like:

  Jun 20 21:00:47 lbtmu103 LBT_CSQ: Invalid handle to block: RPCERROR
  Jun 20 21:00:47 lbtmu103 LBT_CSQ: free unknown handle RPCERROR

but IIF/CSQ commands were still working.

Then at 21:04:45 a Preset was issued which timed out at 21:09:45. This was
because the tracking polynomials from PCS stopped (according to Tom).

We are still getting "Failed block" messages and errors like:

  Jun 20 21:23:58 lbtmu103 LBT_CSQ: Invalid handle to block: RPCERROR
  Jun 20 21:23:58 lbtmu103 LBT_CSQ: Invalid handle to block: RPCERROR
  Jun 20 21:23:58 lbtmu103 LBT_CSQ: free unknown handle RPCERROR
  Jun 20 21:25:44 lbtmu103 LBT_CSQ: TCPSocket::recv error: Connection reset by peer
  Jun 20 21:25:44 lbtmu103 LBT_CSQ: index out of range:

The "index out of range:" message means a block() was called, and an invalid
block number (should be 0 - 9) was requested. The block number allows multiple
commands with blocks() to be pending at the same time.

Starting at 21:27:17 all error messages from the CSQ stopped and the TCS appears
to be working normally.

The LSS was killed (refused to stop) at 22:06, and several more times. It was
hanging trying to read the list of callback clients. This means it never
registered its RPC functions, so all event logging was not working. See
IssueTrak 709.

We noticed that eth1 was down (no ipv4 address) on lbtmu103 where the CSQ and
PCS were running so at 22:35 we stopped the PCS and CSQ, stopped the network
servers on lbtmu103 (22:37), and restarted the PCS and CSQ (they went to
lbtmu104). Everything was working normally except the LSS. In retrospect we
didn't need to do this, since all CSQ errors had stopped by 21:27:17

On the morning of June 21, eth1 was started by Jim. Chris stopped the CSQ and
started it on lbtmu103 and tried simple IIF/CSQ tests from lbccontrol and they
worked OK. Jim stopped eth1 and the tests still worked OK.

The LSS still would not start - hanging reading the client table. However, about
13:30 or so on June 21 it began working. We have no idea why.

-------------------------------------------------------------------

22 June 2007

This analysis of the mountain servers was done on 22 June 2007 based upon
mountain problems reported the morning of 22 June 2007 in IssueTrak 701.
Additional information was collected on 24 June 2007.
TCS Build: 0706201437

Chris and Michele spent some time tracking down information; this included
conversations with Tom and Paul.

The main report was that ~4:00 am, the MCS and PCS stopped working. The "stopped
working" means the MCS and PCS GUIs were essentially frozen. Further, the PCS
was not issuing polynomials to the MCSPU.  However, based upon past experience,
the observers killed the LSS, and this cured the "frozen" problem (see IssueTrak
709). However, there were still issues with the MCS dying and other TCS strangeness.

In checking the SYSLOG for these subsystems at this time, the following items
were noticed:

1) Jun 22 03:59:32 lbtmu101 LBT_Networkserver: multiple instances of PMCR found

In fact there were two versions of PMCR running on servers 103 and 104. Using
"rpcconfig -a" showed the PMCR commands registered on both machines; "netconfig
ps" showed the subsystem actually running each of the machines.

This was a problem very early in the existence of TCS, but has not been seen in
years (literally).  Speculation: if network communication is too slow or
intermittently available, a server may not respond that it is already running
subsystem X.  Then if there were a new request to run subsystem X, the subsystem
could then start up on two different servers. It is NOT good to run multiple
versions of a subsystem.

*** On Sunday (24 June), Michele was checking the mountain systems and noticed
there were two versions of PSFR running.  After talking with John, she stopped
the two versions with a single "netconfig stop PSFR". She made sure they both
stopped. She then restarted PSFR at John's request. *** OLD PROBLEM ***  With a
single "netconfig start PSFR" request, she got PSFR running on two subsystems.
She stopped them both (again) and successfully restarted PSFR with "netconfig
start PSFR on lbtmu105".

2) The SYSLOG on LBTMU101 had many reports like the following:

  Jun 22 03:59:32 lbtmu101 LBT_Networkserver: adding CSQ 1
  Jun 22 03:59:32 lbtmu101 LBT_Networkserver: adding LSS 1
  Jun 22 03:59:32 lbtmu101 LBT_Networkserver: adding PMCR 1

These statements would be repeated many times for all of the subsystems. The
subsystems mentioned in the messages were, in fact, already running on other
servers.  It seemed as though LBTMU101 kept "forgetting" they were running.

*** On Sunday (24 June) the "adding" messages on LBTMU101 were still being
logged.  None of the other servers log these messages.  All the network
interfaces on all the servers are up and running; no errors are reported.

3) Looking at the SYSLOG revealed that eth3 (10.144.0) on LBTMU101 seriously
ramped up its frequency of DHCP requests beginning around 3:50 am.  Before  this
time, DHCP requests were spaced at least minutes apart. After this time, one can see

  Jun 22 03:57:20 lbtmu101 dhclient: DHCPREQUEST on eth3 to 10.144.0.110 port 67
  Jun 22 03:57:47 lbtmu101 dhclient: DHCPREQUEST on eth3 to 10.144.0.110 port 67
  Jun 22 03:57:59 lbtmu101 dhclient: DHCPREQUEST on eth3 to 10.144.0.110 port 67
  Jun 22 03:58:21 lbtmu101 dhclient: DHCPREQUEST on eth3 to 10.144.0.110 port 67
  Jun 22 03:59:01 lbtmu101 dhclient: DHCPREQUEST on eth3 to 10.144.0.110 port 67

It should be noted that the timing/frequency of the requests varies a lot during
the course of a day.

4) The eth1 (10.20.0) interface on LBTMU101 was not present. The previous day it
was found that eth1 (10.20.0) on LBTMU103 had disappeared. Before issuing an
IssueTrak to fix LBTMU103, the programmer checked all the interfaces on all the
servers on the mountain, and they were running.  While we do not know when eth1
disappeared from LBTMU103, eth1 disappeared from LBTMU101 in the last day.

We should note that TCS does *NOT* use 10.20.0 for its services, but one does
wonder what is causing the interfaces to disappear, and if this same trigger is
causing problems for the TCS.

5) We see this problem associated with the MCS

  Jun 22 14:58:09 lbtmu103 LBT_MCS: TCPSocket::recv error: Connection reset by peer
  Jun 22 14:58:10 lbtmu103 LBT_MCS: parseXMLBuf error in exerces RPCERROR

The programmer was able to search the SYSLOGs dating back to May 27, and this
error is only associated with MCS.  In addition, it has only been seen on 19,
20, and 22 June 2007 (as of this writing) on the order of ~20 instances. (Note
that 19-Jun-2007 was before the new build was installed.)

6) We see this problem

  Jun 22 11:39:22 lbtmu103 LBT_CSQ: TCPSocket::recv error: Connection reset by peer

associated with every subsystem.  It dates back to the oldest SYSLOG we
currently have available (May 20), so it is believed to be a constant issue. On
occasion we also see

  Jun  8 16:15:29 lbtmu101 LBT: TCPSocket::send error: Broken pipe

These messages are associated with just LBT as well as any subsystem.

7) On occasion CSQ reports

  Jun 20 21:25:44 lbtmu103 LBT_CSQ: index out of range:

This message dates back to the oldest SYSLOG we currently have available (May
20), so it is believed to be a constant issue.

-- MicheleDeLaPena - 25 Jun 2007
Topic revision: r2 - 06 Oct 2009, NormCushing
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback