Possible Causes of Shell RIP/REST

  • Hardware
    • Overcurrent (create overcurrent or piston overcurrent, see 641a006d Sec.14.1)
    • Power Supply Failure
  • Firmware
    • Maximum distance
    • Watchdogs
      • DSP (see 641a006d Sec.17.5.3)
      • Fiber-link (see 641a006d Sec.17.5.4), Ethernet on BCUs (see 641a006d Sec.17.5.1), btoth
      • Ethernet on Moxa (see 641a006d Sec.17.5.1)
  • Software
    • Housekeeping process (see 641a006d Sec.17.2 for a list of housekeeping variable with descriprion):
      • temperatures, power supply current and voltages out of alarm threshold (see $ADOPT_SOURCE/conf/adsec/672x/processConf/housekeeper/housekeeper.param)
      • housekeeping process stopped
    • Fastdiagnostic process (see 641a006d Sec.17.3.5 for a list of housekeeping variable with descriprion):
      • position, currents, timeout on receiving diagnostic (see $ADOPT_SOURCE/conf/adsec/672x/processConf/fastdiagn/fastdiagn.param)
      • fastdiagn process stopped
    • AdSec arbitrator (missing or too low elevation; missing swing arm status or swing arm not deployed)

When one of the above happens you should look for the cause of error according to the following scheme (see below for log files description):

Cause there is a fault if Where to look for it What string to look for Notes
Overcurrent not assessable    
Power Supply Failure ADAMTCSPOWERFAULTN[0,1,2] == 0 AdamHouseKeeper Telemetry ADAMTCSPOWERFAULTN
Maximum distance dspposcheckcnt > 15 DumpSlow   Hard coded in "proc_startup.pro"
Fiber link indirectly by checking global_counter is constant DumpSlow  
Ethernet on BCUs indirectly by checking global_counter is constant DumpSlow  
Watchdogs on DSP dspdriverstatusdsp0watchdogexpired == 1 DumpSlow  
Ethernet on Moxa adamwatchdogexp == 1 DumpSlow  
HouseKeeper currents and voltages HouseKeeperLog FUNCTEMERGENCYST
FastDiagnostic CHCURRAVERAGE, CHDISTAVERAGE FastDiagnosticLog FUNCTEMERGENCYST
AdSecArbitrator Elevation missing or out of bounds AdSecArbitratorLog "Elevation not available" or "Elevation (XX) deg below"
AdSecArbitrator Swing arm info missing or not deployed AdSecArbitratorLog "Swing arm not deployed!"

Process Log Analysis

Log Files

The list of the process running on the adsecdx/adsecsx workstation is reported here

Log files are locate in $ADOPT_LOG on the AOeng@adsec(sx) or AOeng@adsec(dx) workstation.

Most relevant log and dump files to look for are:

Short name Description Full name
DumpFast Dump of masterdiagnostic (circular buffer at 1kHz) $ADOPT_LOG/adaptive-secondary_XXXXX.log
DumpSlow Dump of slow diagnostics (ADAM & Housekeeper) $ADOPT_LOG/adaptive-secondary-diagnvar_XXXXXX.sav
HousekeeperLog Housekeeper log file $ADOPT_LOG/housekeeper.R.[XXXXXX].log
AdamHousekeeperLog ADAM Housekeeper log file $ADOPT_LOG/adamhousekeeper.R.[XXXXXX].log
FastdiagnosticLog Fastdiagnostic log file $ADOPT_LOG/fastdiagn.R.[XXXXXX].log
AdSecArbLog AdSec arbitrator log files $ADOPT_LOG/adsecarb.R.[XXXXXX].log
AdSecArbDigest AdSec arbitrator digest AdSecArbLog between -- PROCESS DUMP START -- and -- PROCESS DUMP STOP --

Common debug procedure

  • Note the timestamp of the failure
  • Analize in the mentioned order the processes looking for some "|ERR|" or "|FAT|" or a "die" string and refine the timestamp accuracy . The "WAR" strings can be useful to understand what happened before.
    • NOTE: the idlctrl process may by inaccurate in the timestamp
    • NOTE: the ERR regarding TcpXXXXException is not relevant if after few milliseconds the connection is established. Example below.
      adsec.L |ERR| 444559|2013-01-24 00:16:46.603890| ADAM-MODBUS > [TcpReceiveException] Error receiving data (errno=0 ), socket closed. (code -15003) TcpConnection receive error File TcpConnection.cpp line 130 adsec.L |INF| 444560|2013-01-24 00:16:46.603930| MAIN > adamModbusReadOutput_wrap found TCP connection not initialized ... adsec.L |INF| 444561|2013-01-24 00:16:46.604012| TCPCONNECTION > Creating TcpConnection to 192.168.13.150:502 adsec.L |INF| 444562|2013-01-24 00:16:46.608993| TCPCONNECTION > TcpConnection succesfully created! 

  • In case of multiple failures in the same amount of time look for the first occurrency

Common error sources derived from the analysis of the log files

Common to all processes

  • Stop signal given by adsc_stop script (here housekeeper case)
housekeeper.L      |FAT|  45925847|2013-01-24 00:18:59.477209|             MAIN > housekeeper.L terminated by stop

Housekeeper

  • Error/Timeout on reading
housekeeper.L      |ERR|  45925846|2013-01-24 00:18:58.256410|             MAIN > Error reading BCUs:  [CommandSenderTimeoutException] Timeout expired [HouseKeeper.cpp:2630]

It is not really an error. It means that the Housekeeper is not able to retrieve slow diagnostics from the unit. If the unit is switched off or it is starting up, this log is ok. If it happens when the unit is powered on and working, ping all the BCUs and MOXA (AsmFaq).
    • If you have nothing up ask to the people about power loss.
    • If only the MOXA is up ask to the people about three phase power loss. Then check the other processes logs (in particular Housekeeper and Fastdignostic) for some main power action:
housekeeper.L      |INF|  45925853|2013-01-24 00:19:01.262270|      ADAM-MODBUS > AdamModbus: disabling main power...
    • If the MOXA is up AND NOT ALL the BCUs are up, it means THAT BCUs are not responding. Try stop and start of all the software. Check ethernet connection. Try substitute BCU with a spare.

  • Warning on Flux meter
housekeeper.L      |WAR|  45898149|2013-01-23 19:56:02.950378|     FUNCTWARNING > FunctWarning FLUXRATEIN-0000 0.117108

It means the the flow rate of the cooling liquid is very slow. When the system starts up the readings are not ok and these warnings are common. When the system is working means probably that the pump of the cooling is not properly working

  • Emergency stop for over temperature / parameter out of thresholds:

A lot of sensors are monitoring the ASM. If some of that goes out of thresholds the system is automatically put in REST state:
housekeeper.L      |ERR|  45919187|2013-01-23 20:40:33.939430| FUNCTEMERGENCYST > BCUSTRATIXTEMP-0003 = 55.125  [Funct.cpp:94]
housekeeper.L      |INF|  45919188|2013-01-23 20:40:33.939513|      ADAM-MODBUS > AdamModbus: disabling coils...
housekeeper.L      |INF|  45919189|2013-01-23 20:40:33.939526|      ADAM-MODBUS > AdamModbus: SetSingleCoil (coil 3 = 0)
housekeeper.L      |INF|  45919190|2013-01-23 20:40:33.939650|    TCPCONNECTION > Creating TcpConnection to 192.168.13.150:502
housekeeper.L      |INF|  45919191|2013-01-23 20:40:33.943841|    TCPCONNECTION > TcpConnection succesfully created!
housekeeper.L      |INF|  45919192|2013-01-23 20:40:33.943871|      ADAM-MODBUS > Created TCP connection with timeout 1000 ms

If the problem is serious (like over temperature of some components), the unit is also powered down.
housekeeper.L      |INF|  45919193|2013-01-23 20:40:33.958777|      ADAM-MODBUS > AdamModbus: disabling main power...
housekeeper.L      |INF|  45919194|2013-01-23 20:40:33.958802|      ADAM-MODBUS > AdamModbus: SetSingleCoil (coil 7 = 0)

An Alert is sent to the AdsecArbitrator and the telemetry is forced to update.
housekeeper.L      |INF|  45919195|2013-01-23 20:40:33.976846| FUNCTEMERGENCYST > [FunctEmergencyStop] try to send Alert to AdSec Arbitrator  [Funct.cpp:123]
housekeeper.L      |INF|  45919196|2013-01-23 20:40:33.977016| FUNCTEMERGENCYST > [FunctEmergencyStop] Alert to AdSec Arbitrator succesfully sent! [Funct.cpp:129]      
housekeeper.L      |INF|  45919197|2013-01-23 20:40:34.996781|             MAIN > forcing telemetry output due to dump signal [HouseKeeper.cpp]
housekeeper.L      |INF|  45919239|2013-01-23 20:40:35.376559|             MAIN > forcing telemetry output due to dump signal [HouseKeeper.cpp]

  • Alarm on BCU Voltages

It may happen that the system suddenly RIP in a safe state and that the Housekeeper process found a sudden voltage drop. Some example.
housekeeper.L.00001328611618.log.gz:housekeeper.L      |ERR|     36030|2012-02-06 05:28:48.852961| FUNCTEMERGENCYST > BCUVOLTAGEVCCP-0000 = 8.15347  [Funct.cpp:94]
housekeeper.L.00001330377737.log.gz:housekeeper.L      |ERR|    174187|2012-02-27 06:46:03.109906| FUNCTEMERGENCYST > BCUVOLTAGEVCCP-0000 = -0.0372838  [Funct.cpp:94]
housekeeper.L.00001331513151.log.gz:housekeeper.L      |ERR|   1017776|2012-03-11 04:36:21.643147| FUNCTEMERGENCYST > BCUVOLTAGEVCCP-0004 = 9.25321  [Funct.cpp:94]
housekeeper.L.00001331513151.log.gz:housekeeper.L      |ERR|   1119122|2012-03-11 08:38:19.761712| FUNCTEMERGENCYST > BCUVOLTAGEVCCP-0000 = 7.23682  [Funct.cpp:94]
housekeeper.L.00001331838975.log.gz:housekeeper.L      |ERR|   1235374|2012-03-15 17:55:09.744432| FUNCTEMERGENCYST > BCUVOLTAGEVSSP-0003 = -8.49396  [Funct.cpp:94]

If after a reset, this condition seems to be permanent, restart everything and check again. If the problem is still there, probably the is an hardware failure.

If the event occurred during an AO loop, the source could be the instability of the loop that caused a piston force too large to be applied.

If the event occurred during normal seeng limited operations, the source could be an high wind gust.

-- MarcoXompero - 31 Jan 2013
Topic revision: r11 - 22 May 2014, LorenzoBusoni
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback