Telemetry Software

The Collection Library

Release Notes

Sep-2016
  • Update to HDF5 version 1.10.0, version 4.0 (see IT 5950)

Oct-2015
  • The r19 collection library has been modified to add units 'nanosecond', 'day', 'arcminute', 'arcsecond_per_hour', 'degree_per_second', and 'degree_per_square_second'.
  • The current release of r19 has been renamed to Version 3.9. The next update will be Version 3.10.
  • The telemetry linking has been changed. Going forward we will link against /lbt/telemetry/current/ instead of /lbt/telemetry/rnn/. The directories downtown and on the mountain (including jet and ovms) have been renamed, and the new symbolic link 'current' has been created. The name 'r19' is linked to 3.9 to keep existing builds working.

Jun-2015
  • The r19 collection library has been modified to add units 'percent', 'hour', 'minute', and 'microradian'.

May-2015
  • The r19 collection library has been modified to add a mutex to protect the HDF5 error checking across all threads in a process. This will also have the effect of serializing all file append operations, which will add a small amount of time during file writing, but may have the added benefit of putting less strain on the NFS mounted filesystem. This is done because close reading of the HDF5 documentation reveals the internal error stack does not support multi-threading (the error stack is actually a global construct).

Jan-2015
  • The r19 collection library has been modified to add diagnostic syslog messages for failures when creating the packet table and its attributes.

Dec-2014
  • The r19 collection library has been modified to add diagnostic syslog messages when an append operation fails, and retry the packet table append once on failure.

Nov-2014
  • The r19 collection library has been modified to issue a Syslog message whenever a consumer thread starts or stops, giving the TID.
  • The MCSPU has been modified to properly log telemetry callback messages.
  • The "giving up on sending telemetry" message has been changed to a critical error.

Sep-2014
Additional error checking has been added when accessing attributes. There are no functional or file changes associated with this.

Feb-2014
The attributes attached to the packet table have been changed to use fixed length strings instead of variable length. This saves a notable amount of space if the attribute is rewritten often, as is the case for the ending timestamp.

Jan-2013
With TCS Build 2013A, we are using HDF5 for all telemetry.
Telemetry was upgraded in Dec-2012 to produce HDF5 files instead of writing to a mysql database.

Enable / Disable

TCS Subsystems

The global TCS flag collectTelemetry can be set in the file: /lbt/tcs/current/tcs/etc/tcs.conf to enable/disable telemetry for all TCS subsystems .
Likewise, the path to the HDF5 files is in the same file as TEL_STORE_DIR . (doesn't look like the subsystems have their own flag for the directory path)

Also, each subsystem has its own flag

./2014C/iif/etc/iif.conf   IIFtelemetry            bool    true
./2014C/ecs/etc/ecs.conf   ECStelemetry            bool    true
./2014C/gcs/etc/gcs.conf   GCSL.use_telemetry      bool    true
./2014C/gcs/etc/gcs.conf   GCSR.use_telemetry      bool    true
./2014C/oss/etc/oss.conf   OSStelemetry            bool    true        
./2014C/mcs/etc/mcs.conf   mcsTelemetryFlag        bool true
./2014C/mcs/etc/mcs.conf   mcspuTelemetryFlag      bool true
./2014C/pmc/etc/pmc.conf   PMCtelemetry            map<int,bool>   0:true 1:true        
./2014C/psf/etc/psf.conf   PSFtelemetry            map<int,bool>   0:true 1:true     
./2014C/pcs/etc/pcs.conf   PCSTelemetry            bool   true
./2014C/env/etc/env.conf   ENVtelemetry            bool   true                   

MCSPU

MCSPU uses a separate /lbt/tcs/current/tcs/etc/tcs.conf file

OVMS

OVMS telemetry is controlled via two processes - a server, run as a daemon on the ovms host and a client that stops/starts ( pause or resume ) collection. Daily cron jobs pause telemetry in the morning and restart it in the evening.

The OVMS system is a 64-bit system but the compiler is a different version than the TCS system, so the collection library is built on the ovms host. See Software/OVMSSoftwareBuildAndInstall#Telemetry for step-by-step directions.

DIMM

DIMM uses the telemetry collection library directly as of May-2015. The computer is CentOS 4.8 machine, so it needs its own copy of the source.
See Software/DIMMSoftwareBuildAndInstall for building DIMM with the latest telemetry source.

OAC

As of Summer-2018, OAC uses the telemetry collection library. The mountain OAC computer is CentOS 6.5, 32-bit machine, it needs a 32-bit copy of the library (same version as mcspu).

How to Build / Install

Dependencies

The collection library uses Boost (version 1.33.1) for its UTF (Unit Test Framework) implementation and HDF5 for the file storage.

See HDF5CollectionLibraryNotes for changes made to the HDF5 release to support our packed data types and how to build the HDF5 library for use with telemetry.

Build Library

The collection directory is all we use anymore. The Makefile there has targets for
  • all builds the library, test programs, and both sets of documentation
  • api_doc creates the documentation for the external API that's in collection-doc on http://telemetry.lbto.org
  • prj_doc creates the low-level internal documentation
  • simple_test builds Chris' test program
  • example builds Tony's example from the documentation
Multiple shared object libraries have to be created - one 64-bit for TCS subsystems except MCSPU, a 32-bit CentOS 6 for MCSPU and OAC, a separate build for the OVMS CentOS 6 64-bit machine, and the DIMM CentOS 4.8 machine.
Build on OVMS and DIMM

See :
Copy the CentOS 7 64-bit version for TCS

to:
 /lbt/telemetry on tcs1.mountain.lbto.org
 mounted from disk:/FILESYSTEMS/lbt/x86_64

Copy the CentOS 6 32-bit version for MCSPU

to:
 /lbt/telemetry on jet 
/lbt/telemetry on oac.mountain.lbto.org 

Leap Seconds File

Telemetry uses the leap-seconds.list file, typically installed in the directory /lbt/UT . This file is provided by NIST. In early June and December get the next version of the file from ftp://ftp.nist.gov/pub/time/leap-seconds.list and copy it to every place we're running telemetry.
machineSorted ascending directory crontab user
dimm.mountain.lbto.org /lbt/data/UT (nfs mount from tcs1)  
jet.mountain.lbto.org /lbt/UT (symlink to /lbt/data/UT, nfs mount from tcs1)  
luci.luci.mountain.lbto.org /lbt/data/UT (")  
oac.mountain.lbto.org /lbt/UT oac@oac
ovms.mountain.lbto.org (")  
tcs-test.tucson.lbto.org /lbt/UT
tcs@tcs-test
tcs1.mountain.lbto.org /lbt/UT (symlink to /nfs/192.168.39.20/data/UT) tcs@tcs1
tcs2.mountain.lbto.org /lbt/UT (symlink to /lbt/data/UT, nfs mount from tcs1)  

Crontab script to get updated leap second file is:

00 18 10 6,12 * curl -fsS -o /lbt/UT/leap-seconds.list ftp://ftp.nist.gov/pub/time/leap-seconds.list

Web Access

There's a /etc/httpd/conf.d/telemetry.conf file on tel1.tucson.lbto.org that describes the site for web access -- http://telemetry.lbto.org. The document root is /telemetry/html

This machine is a lot of machines
/telemetry
data.tucson.lbto.org
telemetry.tucson.lbto.org
tel1.tucson.lbto.org
downlolads.lbto.org

The low-level documentation is copied to /telemetry/html/collection-doc (we do not have access to the older, db-version of this documentation anymore - but it can be recreated from svn if necessary) :
  gmake prj_doc
  cd prj_doc 
  scp -pr html teladmin@tel1.tucson.lbto.org:/telemetry/html/collection-doc                  teladmin pass is Tucson root password backwards

Then it's accessible via http://telemetry.lbto.org/collection-doc/index.html

Notes / Issues

  • Complaints like The sample with time NNNN MJD-TAI(s) was received out of order. are seen when telemetry gets samples with the same timestamp - check that the timestamp is using precision better than seconds.
  • Messages like Proxy hasn't been bound to a buffer typically mean that a 'store' is called on a proxy that has not been put into a stream with an 'add_child'.

Cleanup / Tuning

The following list is pretty old; it's not really being maintained anymore, just here for historical interest.
Description Status
bugs
need to test if you make the attribute list too big if it keeps going or fails  
test if name allows 1st char non-numeric  
 
cleanup
The req'ts doc was updated in 2012, but it still was not correct with the code. The design doc is even older. We should at least put notes on these docs that they're old and not correct, just for historical purposes, and update the CAN with them.  
Documentation! Every method should at least have a one-liner to tell what it does (except of course, the gets/sets) - we have endless notes on the exceptions but you cannot tell what a given method does at all.  
Need to delete the interface samples_periodically. It only applied to the database version. The attributes it set are no longer used anywhere.  
Exceptions cleanup - there are no longer exceptions thrown for system name not in table, etc. These are part of the method signatures, so it's not just a deletion of a class - it changes method declarations.  
insert methods aren't using metaData anymore - delete from the signature:
 src/stream_table.cxx:502: warning: unused parameter \x91metaData_\x92 src/stream_table.cxx:558: warning: unused parameter \x91metaData_\x92
 
 
enhancements
can we "repack" the file after it is written? it seems the packing is not as good for small datasets as the large ones - there is significant overhead (20 to 50%) for the smaller datasets.  
Add an attribute to the HDF5 dataset the describes whether the data is diagnostic.  
the attributes associated with the dataset are maintained in the header of the table. this implementation limits the size of the set of attributes. we could move them to be implemented in a separate dataset. then the header would have just a link to the separate dataset and the attributes could be unlimited.  

August-2013 Sample Dropped Exceptions

Chris found the sample dropped complaints running even in simulation mode down here on his machine. It turns out these are not because of a buffer full. The sample dropped exception is given if the buffer is full or if the timestamp is the same or old or too far in the future.

June-5-2013 Sample Drops on Jet

The backup on jet has been modified to be during the day:
messages-20130602:Jun  1 17:06:03 jet root: Home directory backup complete
messages-20130602:Jun  2 17:50:04 jet root: Home directory backup complete
messages         :Jun  3 17:17:04 jet root: Home directory backup complete
messages         :Jun  4 17:40:07 jet root: Home directory backup complete
messages         :Jun  5 17:07:03 jet root: Home directory backup complete

But, we are still seeing samples dropped:
Wed Jun  5 05:32:38.104 2013 ..  mcs.mcspu.telemetry.sample_dropped null AZ axis got a sample_dropped exception when it did a commit_sample call.
Mon Jun  3 11:37:44.114 2013 ..  mcs.mcspu.telemetry.sample_dropped null AZ axis got a sample_dropped exception when it did a commit_sample call.

What else is going on at those times?
nothing in =/var/log/messages=

What's going on in the MCSPU log:
Jun  5 05:32:31.466: AZ Tracker :RECEIVED poly from PCS: t0=4877127186.468 t1=4877127186.518 n=2 seq=24726000  314:56:00.0    -12.50     0.00
Jun  5 05:32:38.029: MCSPU: AZ/EL ON_SOURCE just became FALSE AZ= 314:54:38.5  EL= 066:31:41.9 
Jun  5 05:32:38.029: MCSPU: tracking error: az=    0.00, ELerr*cos(EL)=   -4.97,  on sky =    4.97 arcsec
Jun  5 05:32:38.029: AZ Tracker :Track Poly= t0=4877127193.019 t1=4877127193.069 n=2 seq=24726131  314:54:38.3    -12.48     0.00
Jun  5 05:32:38.029: AZ Tracker :actualPositionTime = 4877127194.000, poly evaluated at that time gives  314:54:26.0 
Jun  5 05:32:38.029: EL Tracker:Track Poly= t0=4877127193.019 t1=4877127193.069 n=2 seq=24726131  066:31:41.8     -8.95     0.00
Jun  5 05:32:38.029: EL Tracker:actualPositionTime = 4877127193.010, poly evaluated at that time gives  066:31:41.9 
Jun  5 05:32:38.029: MCSPU: AZ-only ON_SOURCE just became false. AZ= 314:54:38.5 
Jun  5 05:32:38.104: AZ telem: AZ telem. stream got a 'sample_dropped' exception!
Jun  5 05:32:38.106: AZ Tracker :A slew appears to have started spontaneously while in TRACKING state
Jun  5 05:32:38.106: AZ Tracker :State changed from TRACKING --> SLEW_TO_TRACK
Jun  5 05:32:38.536: AZ Tracker :slew done. position =  314:54:31.6  , vel=   -12.082 asec/sec

Jun  5 05:32:38.536: AZ Tracker :Track Poly: t0=4877127193.519 t1=4877127193.569 n=2 seq=24726141  314:54:32.0    -12.48     0.00



Jun  3 11:37:31.202: AZ telem: Telem. storeDataAxis() called 2731133 times.
Jun  3 11:37:34.489: EL Tracker:poly sent to DSP:  t0=4876976289.487 t1=4876976289.537 n=2 seq=21709000  083:51:37.9      5.14    -0.00
Jun  3 11:37:34.489: EL Tracker:current pos =  083:51:37.6   poly t0 - current time=     -0.003
Jun  3 11:37:37.198: LDG RTracker: poly sent to DSP:  t0=4876976292.199 t1=4876976292.279 n=2 seq=3480961  209:59:58.2      0.00     0.00
Jun  3 11:37:37.198: LDG RTracker: current pos =  209:59:58.7   poly t0 - current time=     -0.000
Jun  3 11:37:39.313: EL telem: Telem. storeDataAxis() called 2764565 times.
Jun  3 11:37:44.073: MCSPU: AZ/EL ON_SOURCE just became FALSE AZ= 156:05:48.9  EL= 083:52:26.6 
Jun  3 11:37:44.073: MCSPU: tracking error: az=   -0.03, ELerr*cos(EL)=   12.34,  on sky =   12.34 arcsec
Jun  3 11:37:44.073: AZ Tracker :Track Poly= t0=4876976299.037 t1=4876976299.087 n=2 seq=21709191  156:05:53.0    115.79    -0.00
Jun  3 11:37:44.073: AZ Tracker :actualPositionTime = 4876976300.000, poly evaluated at that time gives  156:07:44.5 
Jun  3 11:37:44.073: EL Tracker:Track Poly= t0=4876976299.037 t1=4876976299.087 n=2 seq=21709191  083:52:26.7      5.08    -0.00
Jun  3 11:37:44.073: EL Tracker:actualPositionTime = 4876976299.010, poly evaluated at that time gives  083:52:26.6 
Jun  3 11:37:44.104: MCSPU: AZ-only ON_SOURCE just became false. AZ= 156:05:48.9 
Jun  3 11:37:44.114: AZ telem: AZ telem. stream got a 'sample_dropped' exception!
Jun  3 11:37:44.134: AZ Tracker :A slew appears to have started spontaneously while in TRACKING state
Jun  3 11:37:44.134: AZ Tracker :State changed from TRACKING --> SLEW_TO_TRACK
Jun  3 11:37:44.414: AZ Tracker :slew done. position =  156:06:23.9  , vel=   118.137 asec/sec

We are not seeing complaints from any other subsystems anymore.

May-2013 Failures

We had a TCS release (2013B) on 24-May-2013. Since the release, we have seen only the occasional AZ or EL failure from MCSPU. These failures are interesting because they are typically only a single sample dropped!


The following info is from logs dated Feb-2013 through 8-May-2013...

We are experiencing four kinds of telemetry failures.

1. Directory creation failures on rollover to new date.
Mostly occurs in the PMC streams, which there are hundreds of streams. But, we've seen it once on MCSPU. The retry fix that Chris put in has seemed to make this occur much less. We've seen just 10 instances of this failure in the period (Feb to 8-May). 12-April is when the retry went into the library. 7 failures creating dir structure before the change went in, 3 after. These are easy to see in the logs because we have a specific error code for them - it's not an HDF complaint:

Fri Mar  1 00:01:00 2013  PMCL pmc.telemetryInfo left left PMC telemetry information: Telemetry notice: Failed to append record because failed to create directory structure, giving up on sending telemetry


2. HDF errors on rollover to new date.
This is NOT only the OSS dyb stream. There are a couple of these for PMC. We have "no error" from HDF5 on ALL of these except for one time (it's possible this is because the HDF error stack is not thread-safe and the error has been cleared).

Wed May  1 22:29:06 2013 PMCL pmc.telemetryInfo left left PMC telemetry information: Telemetry notice: Failed to append record because No HDF5 error, giving up on sending telemetry
Wed May  1 22:32:54 2013 PMCR pmc.telemetryInfo right right PMC telemetry information: Telemetry notice: Failed to append record because No HDF5 error, giving up on sending telemetry 


One time we saw:

Mon Apr 22 00:00:00 2013 OSS  oss.telemetryInfo null Telemetry information: Failed to append record because identifier has invalid type, giving up on sending telemetry

Interestingly, we had a "no HDF5 error" at the same time from OSS - the stream name is not given, so it's hard to tell what happened. We only had "sample buffer is full" complaints from the dyb stream later in this event log. Why would we have two errors from the same creation? Or is it possible the errors were from two different streams and the other stream ignored the error and continued?


3. MCSPU Failures. These are sample buffer full messages, but we do not see the "append" errors in the logs. Sometimes it recovers from the buffer full messages, sometimes it doesn't until a restart.
Both az and el streams from MCSPU fail. But, el seems to fail a little more frequently.
Feb and April about the same failures:

nonOSSTelemetryDrops-April.txt:87  EL
nonOSSTelemetryDrops-April.txt:85  AZ
nonOSSTelemetryDrops-Feb.txt:368  AZ
nonOSSTelemetryDrops-Feb.txt:364  EL

March and May, EL was worse than AZ:

nonOSSTelemetryDrops-March.txt:70  AZ
nonOSSTelemetryDrops-March.txt:298  EL
nonOSSTelemetryDrops-May.txt:139  AZ
nonOSSTelemetryDrops-May.txt:1367  EL

These errors are not just on rollover -- Two errors last night:

Wed May  8 10:08:50 2013 MCS  mcs.mcspu.telemetry.sample_dropped null EL axis got a sample_dropped exception when it did a commit_sample call.
Wed May  8 10:09:37 2013 MCS  mcs.mcspu.telemetry.sample_dropped null EL axis got a sample_dropped exception when it did a commit_sample call. 

These messages are in the middle of a telemetry file. The file: 201305080833.mcspu.el.servo_data.h5 has a start time of "Wed May 8 08:33:49 2013" and end time of "Wed May 8 11:19:49 2013" -- this is much longer than it typically takes servo data to roll over. And there's just the two complaints.
Maybe these are just because telemetry couldn't keep up?

Are AZ/EL the only TCS files that rollover during the night due to size?
PCS trajectories rolls over a couple times a night
PMC servoloops a couple times

Any complaints in the jet logs for these? -- NO, no complaints in the jet syslogs except the sample buffer full messages
there is a notify_disconnected method defined in MCSPU - but we don't get the "collection disconnected" message in the syslog
haven't we seen problems like this before with the MCSPU not getting called back?

Is the consumer thread somehow going away quietly?
need to look in the collection library for when this thread might exit, aside from the "giving up on telemetry" scenario.

Tom went through the complaints that involve just 1 or 2 or 3 dropped samples (as opposed to the ones that go on for hours with thousands of dropped samples). Found they nearly all happen at between 10:00 UT and 10:50 UT. Seems unlikely to be a coincidence. What happens at 3AM every day?
home directory backup!

telemetry error                       messages file
may  7 - 21:09
may  8 - 10:08                        May  8 10:08:30 jet root: Home directory backup complete
may 14 - 10:38                        May 14 10:37:20 jet root: Home directory backup complete
may 15 - 10:44                        May 15 10:43:05 jet root: Home directory backup complete
may 18 - 10:50                        May 18 10:49:20 jet root: Home directory backup complete
may 19 - 10:39                        May 19 10:39:22 jet root: Home directory backup complete
may 20 - 10:44                        May 20 10:43:04 jet root: Home directory backup complete
may 24 - 10:31                        May 24 10:30:03 jet root: Home directory backup complete
may 25 - 10:12                        May 25 10:11:06 jet root: Home directory backup complete
may 29 - 10:24                        May 29 10:23:52 jet root: Home directory backup complete
may 30 - 10:32                        May 30 10:32:18 jet root: Home directory backup complete

  • why just a single sample dropped?? and only on jet.
  • the consumer decides based on how many in buffer whether to do one sample or multiples - could there be some timing issue associated with the exception? maybe the time it takes to handle the exception is long enough to free up space in the buffer?
  • where are these samples in relation to the HDF5 file? early in file? last one chris checked was about 1.5 hours into the file
  • mcspu buffers hold about 144sec of data (10Hz)
4. Trying to write an invalid named stream, by OSS.
This problem is an application problem and goes away with the next build of the OSS subsystem
Topic attachments
I Attachment Action Size Date Who Comment
P021.pdfpdf P021.pdf manage 30 K 07 Apr 2016 - 20:55 UnknownUser ADASS 2013 Evolution ofthe LBT Telemetry System
Topic revision: r36 - 12 Jun 2020, CarlHolmberg
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback