General TCS Troubleshooting Tips
Startup
- All subsystems should be started from the TCSGUI as user
tcs
. Either invoke the TCSGUI on a cluster computer, or invoke it from an obs computer as user telescope.
- All subsystems log to the directory
/lbt/log
, events are written to current.events
and log messages to current.log
- Some subsystems are configured (in the
tcs.conf
file) to run on particular servers
See the values like start_on map<string,string> start_on IIF:tcs2 DDS:tcs2
But they can be forced to start on any host by typing the hostname into the box on the GUI before hitting the start button.
- When subsystems are started normally,
stdout
and stderr
point to the terminal that started the networkserver
which may no longer be available.
When a subsystem is started on local
, both stdout
and stderr
are available and attached to the controlling terminal. The TCSGUI cannot start a subsystem on local; do that with netconfig start sub_name on local
on a cluster computer.
Config Files
- Subsystem public configuration files are located in
/home/telescope/TCS/Configuration/<SUBSYSTEM>
- These are the files which have been "delivered" to the Science staff; they modify them as necessary
- A semi-private configuration file is
<subsystem>.conf
(e.g., pcs.conf
). These subsystem files are part of the build and located in /lbt/tcs/current/<subsystem>/etc
.
- Any other private configuration files are part of the build and are located in
/lbt/tcs/current/<subsystem>/configuration
.
- The location of the files used by the TCS are found in the
<subsystem>.conf
files or the /lbt/tcs/current/tcs/etc/tcs.conf
file.
In some cases the configuration area will have some of the same files as located in the telescope
TCS area. However, folks should NOT modify these files in
general. At least in the case of
PCS, these are NOT the files actually used by the
TCS.
Network Servers
rpcconfig
rpcconfig
is the interface to the
rpcserver
process which is the name server for the
TCS RPC
system. Use
rpcconfig
to see all the registered methods available on a particular machine.
We have had instances of a subsystem being up and running, but we have lost some or all of its functions. In that case, the subsystems running on that server must be restarted after stopping/starting the rpcserver process.
Usage: rpcconfig command [parameter]
Configure the rpc server
Command Description
------- -----------
help, -h Print usage information
start address [warm|passive] [passive|warm] Start the server
stop address Stop the server
list, -l List all parameters
functions, -f [address] List all registered functions
alias, -a [address] List all aliases
Use this command on
jet
to start the
rpcserver
before starting the
MCSPU software:
rpcconfig start 192.168.18.170
netconfig
netconfig
is the interface to the
networkserver
process.
netconfig
provides control of all the subsystems and servers. If you are manually starting/stopping the
TCS, use this command. The
ps
,
-l
, and
-s
options are useful to see what's going on. Note that
start
and
stop
also start/stop rpcserver and gshmserver, while
start/stop all
will start/stop all the subsystems on the cluster. The order for starting/stopping all the subsystems is determined by the
tcs.conf
variable
subsystems.
[telescope@tcs2 ~]$ netconfig -h
Usage: netconfig start [left | right] (subsystem | all) [(on | -o) host] [(parameter | -p) param]
Usage: netconfig stop (subsystem | all | local)
Usage: netconfig kill subsystem
Usage: netconfig start syslogserver
Usage: netconfig stop syslogserver
Command Description
------- -----------
ps Show all known TCS processes
top Run 'top' to show all known TCS processes
help, -h Print usage information
version, -v Print version information
cksum, -k Print check sum information
config, -c Print configuration data
list, -l List all current subsystems
servers, -s List all current servers
xml, -x List all current servers as xml data
delete, -d Delete the shared memory segment
zero, -z Zero the shared memory segement
ZERO, -Z Zero the shared memory segement even if in use
start [active|passive] Start the server
stop Stop the server
status Show internal state
remove <server> Remove server <server> (use with caution!)
gshmconfig
gshmconfig
is the interface to the
gshmserver
process. Use this command to configure and control the reflective memory. See
Software/SharedMemoryProblems#gshmconfig
syslogserver
This process is the 'listener' that receives syslog output directed to facility
local6
and actually writes the log file. It only runs on one machine in the cluster, determined by the DNS entry
tcslog
, and the environmental variable
TCSSYSLOGRUNON
.
Telemetry
The most likely cause of telemetry errors will be directory creation or changing the group on the files. Therefore, Telemetry complaints almost always will be at startup or rollover.
The
/lbt/telemetry_data
file system should be set up with the
tcs
subdirectory having group write permission, with the group being
telemetry
. On jet, if the
chgrp
fails when creating a new file, the group will be
domain users
. If you see any telemetry directories or files created with this group, it is likely a failure. Delete the file and restart the subsystem.
sample
mcspu
log complaints:
Mar 10 17:07:31.731: EL telem: Exception in Sample_buf constructor! 'step2' : Failed to open file for stream servo_data because unable to create file