The Collection Library
- Update to HDF5 version 1.10.0, version 4.0 (see IT 5950)
- The r19 collection library has been modified to add units 'nanosecond', 'day', 'arcminute', 'arcsecond_per_hour', 'degree_per_second', and 'degree_per_square_second'.
- The current release of r19 has been renamed to Version 3.9. The next update will be Version 3.10.
- The telemetry linking has been changed. Going forward we will link against /lbt/telemetry/current/ instead of /lbt/telemetry/rnn/. The directories downtown and on the mountain (including jet and ovms) have been renamed, and the new symbolic link 'current' has been created. The name 'r19' is linked to 3.9 to keep existing builds working.
- The r19 collection library has been modified to add units 'percent', 'hour', 'minute', and 'microradian'.
- The r19 collection library has been modified to add a mutex to protect the HDF5 error checking across all threads in a process. This will also have the effect of serializing all file append operations, which will add a small amount of time during file writing, but may have the added benefit of putting less strain on the NFS mounted filesystem. This is done because close reading of the HDF5 documentation reveals the internal error stack does not support multi-threading (the error stack is actually a global construct).
- The r19 collection library has been modified to add diagnostic syslog messages for failures when creating the packet table and its attributes.
- The r19 collection library has been modified to add diagnostic syslog messages when an append operation fails, and retry the packet table append once on failure.
- The r19 collection library has been modified to issue a Syslog message whenever a consumer thread starts or stops, giving the TID.
- The MCSPU has been modified to properly log telemetry callback messages.
- The "giving up on sending telemetry" message has been changed to a critical error.
Additional error checking has been added when accessing attributes. There are no functional or file changes associated with this.
The attributes attached to the packet table have been changed to use fixed length strings instead of variable length. This saves a notable amount of space if the attribute is rewritten often, as is the case for the ending timestamp.
Build 2013A, we are using HDF5 for all telemetry.
Telemetry was upgraded in Dec-2012 to produce HDF5 files instead of writing to a mysql database.
Enable / Disable
The global TCS
can be set in the file:
to enable/disable telemetry for all TCS subsystems
Likewise, the path to the HDF5 files is in the same file as
. (doesn't look like the subsystems have their own flag for the directory path
Also, each subsystem has its own flag
./2014C/iif/etc/iif.conf IIFtelemetry bool true
./2014C/ecs/etc/ecs.conf ECStelemetry bool true
./2014C/gcs/etc/gcs.conf GCSL.use_telemetry bool true
./2014C/gcs/etc/gcs.conf GCSR.use_telemetry bool true
./2014C/oss/etc/oss.conf OSStelemetry bool true
./2014C/mcs/etc/mcs.conf mcsTelemetryFlag bool true
./2014C/mcs/etc/mcs.conf mcspuTelemetryFlag bool true
./2014C/pmc/etc/pmc.conf PMCtelemetry map<int,bool> 0:true 1:true
./2014C/psf/etc/psf.conf PSFtelemetry map<int,bool> 0:true 1:true
./2014C/pcs/etc/pcs.conf PCSTelemetry bool true
./2014C/env/etc/env.conf ENVtelemetry bool true
uses a separate
telemetry is controlled via two processes - a server, run as a daemon on the ovms host and a client that stops/starts (
) collection. Daily cron jobs pause telemetry in the morning and restart it in the evening.
system is a 64-bit system but the compiler is a different version than the TCS
system, so the collection library is built on the ovms host. See Software/OVMSSoftwareBuildAndInstall#Telemetry
for step-by-step directions.
uses the telemetry collection library directly as of May-2015. The computer is CentOS 4.8 machine, so it needs its own copy of the source.
for building DIMM
with the latest telemetry source.
As of Summer-2018, OAC
uses the telemetry collection library. The mountain OAC
computer is CentOS 6.5, 32-bit machine, it needs a 32-bit copy of the library (same version as mcspu).
How to Build / Install
The collection library uses Boost (version 1.33.1) for its
UTF (Unit Test Framework)
implementation and HDF5 for the file storage.
for changes made to the HDF5 release to support our packed data types and how to build the HDF5 library for use with telemetry.
directory is all we use anymore. The
there has targets for
all builds the library, test programs, and both sets of documentation
api_doc creates the documentation for the external API that's in
collection-doc on http://telemetry.lbto.org
prj_doc creates the low-level internal documentation
simple_test builds Chris' test program
example builds Tony's example from the documentation
Multiple shared object libraries have to be created - one 64-bit for TCS
subsystems except MCSPU
, a 32-bit CentOS 6 for MCSPU
, a separate build for the OVMS
CentOS 6 64-bit machine, and the DIMM
CentOS 4.8 machine.
Copy the CentOS 7 64-bit version for TCS
/lbt/telemetry on tcs1.mountain.lbto.org
mounted from disk:/FILESYSTEMS/lbt/x86_64
Copy the CentOS 6 32-bit version for MCSPU
/lbt/telemetry on jet
/lbt/telemetry on oac.mountain.lbto.org
Leap Seconds File
Telemetry uses the
file, typically installed in the directory
. This file is provided by NIST. In early June and December get the next version of the file from ftp://ftp.nist.gov/pub/time/leap-seconds.list
and copy it to every place we're running telemetry.
|| crontab user
/lbt/UT (symlink to /nfs/192.168.39.20/data/UT)
|| /lbt/UT (symlink to /lbt/data/UT, nfs mount from tcs1)
/lbt/data/UT (nfs mount from tcs1)
|| /lbt/UT (symlink to /lbt/data/UT, nfs mount from tcs1)
Crontab script to get updated leap second file is:
00 18 10 6,12 * curl -fsS -o /lbt/UT/leap-seconds.list ftp://ftp.nist.gov/pub/time/leap-seconds.list
that describes the site for web access -- http://telemetry.lbto.org
. The document root is
This machine is a lot of machines
The low-level documentation is copied to
(we do not have access to the older, db-version of this documentation anymore - but it can be recreated from svn if necessary) :
scp -pr html email@example.com:/telemetry/html/collection-doc teladmin pass is Tucson root password backwards
Then it's accessible via http://telemetry.lbto.org/collection-doc/index.html
Notes / Issues
- Complaints like
The sample with time NNNN MJD-TAI(s) was received out of order. are seen when telemetry gets samples with the same timestamp - check that the timestamp is using precision better than seconds.
- Messages like
Proxy hasn't been bound to a buffer typically mean that a 'store' is called on a proxy that has not been put into a stream with an 'add_child'.
Cleanup / Tuning
The following list is pretty old; it's not really being maintained anymore, just here for historical interest.
| need to test if you make the attribute list too big if it keeps going or fails
| test if name allows 1st char non-numeric
| The req'ts doc was updated in 2012, but it still was not correct with the code. The design doc is even older. We should at least put notes on these docs that they're old and not correct, just for historical purposes, and update the CAN with them.
| Documentation! Every method should at least have a one-liner to tell what it does (except of course, the gets/sets) - we have endless notes on the exceptions but you cannot tell what a given method does at all.
| Need to delete the interface
samples_periodically. It only applied to the database version. The attributes it set are no longer used anywhere.
| Exceptions cleanup - there are no longer exceptions thrown for system name not in table, etc. These are part of the method signatures, so it's not just a deletion of a class - it changes method declarations.
| insert methods aren't using metaData anymore - delete from the signature:
src/stream_table.cxx:502: warning: unused parameter \x91metaData_\x92 src/stream_table.cxx:558: warning: unused parameter \x91metaData_\x92
| can we "repack" the file after it is written? it seems the packing is not as good for small datasets as the large ones - there is significant overhead (20 to 50%) for the smaller datasets.
| Add an attribute to the HDF5 dataset the describes whether the data is diagnostic.
| the attributes associated with the dataset are maintained in the header of the table. this implementation limits the size of the set of attributes. we could move them to be implemented in a separate dataset. then the header would have just a link to the separate dataset and the attributes could be unlimited.
August-2013 Sample Dropped Exceptions
Chris found the sample dropped complaints running even in simulation mode down here on his machine. It turns out these are not because of a buffer full. The sample dropped exception is given if the buffer is full or if the timestamp is the same or old or too far in the future.
June-5-2013 Sample Drops on Jet
The backup on jet has been modified to be during the day:
messages-20130602:Jun 1 17:06:03 jet root: Home directory backup complete
messages-20130602:Jun 2 17:50:04 jet root: Home directory backup complete
messages :Jun 3 17:17:04 jet root: Home directory backup complete
messages :Jun 4 17:40:07 jet root: Home directory backup complete
messages :Jun 5 17:07:03 jet root: Home directory backup complete
But, we are still seeing samples dropped:
Wed Jun 5 05:32:38.104 2013 .. mcs.mcspu.telemetry.sample_dropped null AZ axis got a sample_dropped exception when it did a commit_sample call.
Mon Jun 3 11:37:44.114 2013 .. mcs.mcspu.telemetry.sample_dropped null AZ axis got a sample_dropped exception when it did a commit_sample call.
What else is going on at those times? nothing in =/var/log/messages=
What's going on in the MCSPU
Jun 5 05:32:31.466: AZ Tracker :RECEIVED poly from PCS: t0=4877127186.468 t1=4877127186.518 n=2 seq=24726000 314:56:00.0 -12.50 0.00
Jun 5 05:32:38.029: MCSPU: AZ/EL ON_SOURCE just became FALSE AZ= 314:54:38.5 EL= 066:31:41.9
Jun 5 05:32:38.029: MCSPU: tracking error: az= 0.00, ELerr*cos(EL)= -4.97, on sky = 4.97 arcsec
Jun 5 05:32:38.029: AZ Tracker :Track Poly= t0=4877127193.019 t1=4877127193.069 n=2 seq=24726131 314:54:38.3 -12.48 0.00
Jun 5 05:32:38.029: AZ Tracker :actualPositionTime = 4877127194.000, poly evaluated at that time gives 314:54:26.0
Jun 5 05:32:38.029: EL Tracker:Track Poly= t0=4877127193.019 t1=4877127193.069 n=2 seq=24726131 066:31:41.8 -8.95 0.00
Jun 5 05:32:38.029: EL Tracker:actualPositionTime = 4877127193.010, poly evaluated at that time gives 066:31:41.9
Jun 5 05:32:38.029: MCSPU: AZ-only ON_SOURCE just became false. AZ= 314:54:38.5
Jun 5 05:32:38.104: AZ telem: AZ telem. stream got a 'sample_dropped' exception!
Jun 5 05:32:38.106: AZ Tracker :A slew appears to have started spontaneously while in TRACKING state
Jun 5 05:32:38.106: AZ Tracker :State changed from TRACKING --> SLEW_TO_TRACK
Jun 5 05:32:38.536: AZ Tracker :slew done. position = 314:54:31.6 , vel= -12.082 asec/sec
Jun 5 05:32:38.536: AZ Tracker :Track Poly: t0=4877127193.519 t1=4877127193.569 n=2 seq=24726141 314:54:32.0 -12.48 0.00
Jun 3 11:37:31.202: AZ telem: Telem. storeDataAxis() called 2731133 times.
Jun 3 11:37:34.489: EL Tracker:poly sent to DSP: t0=4876976289.487 t1=4876976289.537 n=2 seq=21709000 083:51:37.9 5.14 -0.00
Jun 3 11:37:34.489: EL Tracker:current pos = 083:51:37.6 poly t0 - current time= -0.003
Jun 3 11:37:37.198: LDG RTracker: poly sent to DSP: t0=4876976292.199 t1=4876976292.279 n=2 seq=3480961 209:59:58.2 0.00 0.00
Jun 3 11:37:37.198: LDG RTracker: current pos = 209:59:58.7 poly t0 - current time= -0.000
Jun 3 11:37:39.313: EL telem: Telem. storeDataAxis() called 2764565 times.
Jun 3 11:37:44.073: MCSPU: AZ/EL ON_SOURCE just became FALSE AZ= 156:05:48.9 EL= 083:52:26.6
Jun 3 11:37:44.073: MCSPU: tracking error: az= -0.03, ELerr*cos(EL)= 12.34, on sky = 12.34 arcsec
Jun 3 11:37:44.073: AZ Tracker :Track Poly= t0=4876976299.037 t1=4876976299.087 n=2 seq=21709191 156:05:53.0 115.79 -0.00
Jun 3 11:37:44.073: AZ Tracker :actualPositionTime = 4876976300.000, poly evaluated at that time gives 156:07:44.5
Jun 3 11:37:44.073: EL Tracker:Track Poly= t0=4876976299.037 t1=4876976299.087 n=2 seq=21709191 083:52:26.7 5.08 -0.00
Jun 3 11:37:44.073: EL Tracker:actualPositionTime = 4876976299.010, poly evaluated at that time gives 083:52:26.6
Jun 3 11:37:44.104: MCSPU: AZ-only ON_SOURCE just became false. AZ= 156:05:48.9
Jun 3 11:37:44.114: AZ telem: AZ telem. stream got a 'sample_dropped' exception!
Jun 3 11:37:44.134: AZ Tracker :A slew appears to have started spontaneously while in TRACKING state
Jun 3 11:37:44.134: AZ Tracker :State changed from TRACKING --> SLEW_TO_TRACK
Jun 3 11:37:44.414: AZ Tracker :slew done. position = 156:06:23.9 , vel= 118.137 asec/sec
We are not seeing complaints from any other subsystems anymore.
We had a TCS
release (2013B) on 24-May-2013. Since the release, we have seen only the occasional AZ or EL failure from MCSPU
. These failures are interesting because they are typically only a single sample dropped!
The following info is from logs dated Feb-2013 through 8-May-2013...
We are experiencing four kinds of telemetry failures.
1. Directory creation failures on rollover to new date.
Mostly occurs in the PMC
streams, which there are hundreds of streams. But, we've seen it once on MCSPU
. The retry fix that Chris put in has seemed to make this occur much less. We've seen just 10 instances of this failure in the period (Feb to 8-May). 12-April is when the retry went into the library. 7 failures creating dir structure before the change went in, 3 after. These are easy to see in the logs because we have a specific error code for them - it's not an HDF complaint:
Fri Mar 1 00:01:00 2013 PMCL pmc.telemetryInfo left left PMC telemetry information: Telemetry notice: Failed to append record because failed to create directory structure, giving up on sending telemetry
2. HDF errors on rollover to new date.
This is NOT only the OSS
dyb stream. There are a couple of these for PMC
. We have "no error" from HDF5 on ALL of these except for one time (it's possible this is because the HDF error stack is not thread-safe and the error has been cleared).
Wed May 1 22:29:06 2013 PMCL pmc.telemetryInfo left left PMC telemetry information: Telemetry notice: Failed to append record because No HDF5 error, giving up on sending telemetry
Wed May 1 22:32:54 2013 PMCR pmc.telemetryInfo right right PMC telemetry information: Telemetry notice: Failed to append record because No HDF5 error, giving up on sending telemetry
One time we saw:
Mon Apr 22 00:00:00 2013 OSS oss.telemetryInfo null Telemetry information: Failed to append record because identifier has invalid type, giving up on sending telemetry
Interestingly, we had a "no HDF5 error" at the same time from OSS - the stream name is not given, so it's hard to tell what happened. We only had "sample buffer is full" complaints from the dyb stream later in this event log. Why would we have two errors from the same creation? Or is it possible the errors were from two different streams and the other stream ignored the error and continued?
3. MCSPU Failures.
These are sample buffer full messages, but we do not see the "append" errors in the logs. Sometimes it recovers from the buffer full messages, sometimes it doesn't until a restart.
Both az and el streams from MCSPU
fail. But, el seems to fail a little more frequently.
Feb and April about the same failures:
March and May, EL was worse than AZ:
These errors are not just on rollover -- Two errors last night:
Wed May 8 10:08:50 2013 MCS mcs.mcspu.telemetry.sample_dropped null EL axis got a sample_dropped exception when it did a commit_sample call.
Wed May 8 10:09:37 2013 MCS mcs.mcspu.telemetry.sample_dropped null EL axis got a sample_dropped exception when it did a commit_sample call.
These messages are in the middle of a telemetry file. The file:
has a start time of "Wed May 8 08:33:49 2013" and end time of "Wed May 8 11:19:49 2013" -- this is much longer than it typically takes servo data to roll over. And there's just the two complaints. Maybe these are just because telemetry couldn't keep up?
Are AZ/EL the only TCS
files that rollover during the night due to size?
rolls over a couple times a night
a couple times
Any complaints in the jet logs for these? -- NO, no complaints in the jet syslogs except the sample buffer full messages
there is a
method defined in MCSPU
- but we don't get the "collection disconnected" message in the syslog haven't we seen problems like this before with the MCSPU not getting called back?
Is the consumer thread somehow going away quietly? need to look in the collection library for when this thread might exit, aside from the "giving up on telemetry" scenario.
Tom went through the complaints that involve just 1 or 2 or 3 dropped samples (as opposed to the ones that go on for hours with thousands of dropped samples). Found they nearly all happen at between 10:00 UT and 10:50 UT. Seems unlikely to be a coincidence. What happens at 3AM every day? home directory backup!
telemetry error messages file
may 7 - 21:09
may 8 - 10:08 May 8 10:08:30 jet root: Home directory backup complete
may 14 - 10:38 May 14 10:37:20 jet root: Home directory backup complete
may 15 - 10:44 May 15 10:43:05 jet root: Home directory backup complete
may 18 - 10:50 May 18 10:49:20 jet root: Home directory backup complete
may 19 - 10:39 May 19 10:39:22 jet root: Home directory backup complete
may 20 - 10:44 May 20 10:43:04 jet root: Home directory backup complete
may 24 - 10:31 May 24 10:30:03 jet root: Home directory backup complete
may 25 - 10:12 May 25 10:11:06 jet root: Home directory backup complete
may 29 - 10:24 May 29 10:23:52 jet root: Home directory backup complete
may 30 - 10:32 May 30 10:32:18 jet root: Home directory backup complete
4. Trying to write an invalid named stream, by OSS. This problem is an application problem and goes away with the next build of the OSS subsystem
- why just a single sample dropped?? and only on jet.
- the consumer decides based on how many in buffer whether to do one sample or multiples - could there be some timing issue associated with the exception? maybe the time it takes to handle the exception is long enough to free up space in the buffer?
- where are these samples in relation to the HDF5 file? early in file? last one chris checked was about 1.5 hours into the file
- mcspu buffers hold about 144sec of data (10Hz)