Possible Causes of Shell RIP/REST
- Hardware
- Overcurrent (create overcurrent or piston overcurrent, see 641a006d Sec.14.1)
- Power Supply Failure
- Firmware
- Maximum distance
- Watchdogs
- DSP (see 641a006d Sec.17.5.3)
- Fiber-link (see 641a006d Sec.17.5.4), Ethernet on BCUs (see 641a006d Sec.17.5.1), btoth
- Ethernet on Moxa (see 641a006d Sec.17.5.1)
- Software
- Housekeeping process (see 641a006d Sec.17.2 for a list of housekeeping variable with descriprion):
- temperatures, power supply current and voltages out of alarm threshold (see $ADOPT_SOURCE/conf/adsec/672x/processConf/housekeeper/housekeeper.param)
- housekeeping process stopped
- Fastdiagnostic process (see 641a006d Sec.17.3.5 for a list of housekeeping variable with descriprion):
- position, currents, timeout on receiving diagnostic (see $ADOPT_SOURCE/conf/adsec/672x/processConf/fastdiagn/fastdiagn.param)
- fastdiagn process stopped
- AdSec arbitrator (missing or too low elevation; missing swing arm status or swing arm not deployed)
When one of the above happens you should look for the cause of error according to the following scheme (see below for log files description):
Process Log Analysis
Log Files
The list of the process running on the adsecdx/adsecsx workstation is reported
here
Log files are locate in $ADOPT_LOG on the AOeng@adsec(sx) or AOeng@adsec(dx) workstation.
Most relevant log and dump files to look for are:
Common debug procedure
- Note the timestamp of the failure
- Analize in the mentioned order the processes looking for some "|ERR|" or "|FAT|" or a "die" string and refine the timestamp accuracy . The "WAR" strings can be useful to understand what happened before.
- NOTE: the idlctrl process may by inaccurate in the timestamp
- NOTE: the ERR regarding TcpXXXXException is not relevant if after few milliseconds the connection is established. Example below.
adsec.L |ERR| 444559|2013-01-24 00:16:46.603890| ADAM-MODBUS > [TcpReceiveException] Error receiving data (errno=0 ), socket closed. (code -15003) TcpConnection receive error File TcpConnection.cpp line 130 adsec.L |INF| 444560|2013-01-24 00:16:46.603930| MAIN > adamModbusReadOutput_wrap found TCP connection not initialized ... adsec.L |INF| 444561|2013-01-24 00:16:46.604012| TCPCONNECTION > Creating TcpConnection to 192.168.13.150:502 adsec.L |INF| 444562|2013-01-24 00:16:46.608993| TCPCONNECTION > TcpConnection succesfully created!
- In case of multiple failures in the same amount of time look for the first occurrency
Common error sources derived from the analysis of the log files
Common to all processes
- Stop signal given by adsc_stop script (here housekeeper case)
housekeeper.L |FAT| 45925847|2013-01-24 00:18:59.477209| MAIN > housekeeper.L terminated by stop
Housekeeper
housekeeper.L |ERR| 45925846|2013-01-24 00:18:58.256410| MAIN > Error reading BCUs: [CommandSenderTimeoutException] Timeout expired [HouseKeeper.cpp:2630]
It is not really an error. It means that the Housekeeper is not able to retrieve slow diagnostics from the unit. If the unit is switched off or it is starting up, this log is ok. If it happens when the unit is powered on and working, ping all the BCUs and MOXA (
AsmFaq).
-
- If you have nothing up ask to the people about power loss.
- If only the MOXA is up ask to the people about three phase power loss. Then check the other processes logs (in particular Housekeeper and Fastdignostic) for some main power action:
housekeeper.L |INF| 45925853|2013-01-24 00:19:01.262270| ADAM-MODBUS > AdamModbus: disabling main power...
-
- If the MOXA is up AND NOT ALL the BCUs are up, it means THAT BCUs are not responding. Try stop and start of all the software. Check ethernet connection. Try substitute BCU with a spare.
housekeeper.L |WAR| 45898149|2013-01-23 19:56:02.950378| FUNCTWARNING > FunctWarning FLUXRATEIN-0000 0.117108
It means the the flow rate of the cooling liquid is very slow. When the system starts up the readings are not ok and these warnings are common. When the system is working means probably that the pump of the cooling is not properly working
- Emergency stop for over temperature / parameter out of thresholds:
A lot of sensors are monitoring the ASM. If some of that goes out of thresholds the system is automatically put in REST state:
housekeeper.L |ERR| 45919187|2013-01-23 20:40:33.939430| FUNCTEMERGENCYST > BCUSTRATIXTEMP-0003 = 55.125 [Funct.cpp:94]
housekeeper.L |INF| 45919188|2013-01-23 20:40:33.939513| ADAM-MODBUS > AdamModbus: disabling coils...
housekeeper.L |INF| 45919189|2013-01-23 20:40:33.939526| ADAM-MODBUS > AdamModbus: SetSingleCoil (coil 3 = 0)
housekeeper.L |INF| 45919190|2013-01-23 20:40:33.939650| TCPCONNECTION > Creating TcpConnection to 192.168.13.150:502
housekeeper.L |INF| 45919191|2013-01-23 20:40:33.943841| TCPCONNECTION > TcpConnection succesfully created!
housekeeper.L |INF| 45919192|2013-01-23 20:40:33.943871| ADAM-MODBUS > Created TCP connection with timeout 1000 ms
If the problem is serious (like over temperature of some components), the unit is also powered down.
housekeeper.L |INF| 45919193|2013-01-23 20:40:33.958777| ADAM-MODBUS > AdamModbus: disabling main power...
housekeeper.L |INF| 45919194|2013-01-23 20:40:33.958802| ADAM-MODBUS > AdamModbus: SetSingleCoil (coil 7 = 0)
An Alert is sent to the
AdsecArbitrator and the telemetry is forced to update.
housekeeper.L |INF| 45919195|2013-01-23 20:40:33.976846| FUNCTEMERGENCYST > [FunctEmergencyStop] try to send Alert to AdSec Arbitrator [Funct.cpp:123]
housekeeper.L |INF| 45919196|2013-01-23 20:40:33.977016| FUNCTEMERGENCYST > [FunctEmergencyStop] Alert to AdSec Arbitrator succesfully sent! [Funct.cpp:129]
housekeeper.L |INF| 45919197|2013-01-23 20:40:34.996781| MAIN > forcing telemetry output due to dump signal [HouseKeeper.cpp]
housekeeper.L |INF| 45919239|2013-01-23 20:40:35.376559| MAIN > forcing telemetry output due to dump signal [HouseKeeper.cpp]
It may happen that the system suddenly RIP in a safe state and that the Housekeeper process found a sudden voltage drop. Some example.
housekeeper.L.00001328611618.log.gz:housekeeper.L |ERR| 36030|2012-02-06 05:28:48.852961| FUNCTEMERGENCYST > BCUVOLTAGEVCCP-0000 = 8.15347 [Funct.cpp:94]
housekeeper.L.00001330377737.log.gz:housekeeper.L |ERR| 174187|2012-02-27 06:46:03.109906| FUNCTEMERGENCYST > BCUVOLTAGEVCCP-0000 = -0.0372838 [Funct.cpp:94]
housekeeper.L.00001331513151.log.gz:housekeeper.L |ERR| 1017776|2012-03-11 04:36:21.643147| FUNCTEMERGENCYST > BCUVOLTAGEVCCP-0004 = 9.25321 [Funct.cpp:94]
housekeeper.L.00001331513151.log.gz:housekeeper.L |ERR| 1119122|2012-03-11 08:38:19.761712| FUNCTEMERGENCYST > BCUVOLTAGEVCCP-0000 = 7.23682 [Funct.cpp:94]
housekeeper.L.00001331838975.log.gz:housekeeper.L |ERR| 1235374|2012-03-15 17:55:09.744432| FUNCTEMERGENCYST > BCUVOLTAGEVSSP-0003 = -8.49396 [Funct.cpp:94]
If after a reset, this condition seems to be permanent, restart everything and check again. If the problem is still there, probably the is an hardware failure.
If the event occurred during an AO loop, the source could be the instability of the loop that caused a piston force too large to be applied.
If the event occurred during normal seeng limited operations, the source could be an high wind gust.
--
MarcoXompero - 31 Jan 2013