Please note that the following notes apply to the MySQL database version of telemetry collection.
This page does not apply to any telemetry data collected since the end of January-2013.
Telemetry Data Maintenance Procedures
Tony's notes from May-2012.
Kellee has added a few notes about the
mail usage, current slave backup procedures, changed
Telemetry logically consists of two databases. The tel_streams
database contains all of the samples. Each table contains one and only one telemetry stream. The tel_metadata
database contains all the descriptive data including the structural data of the samples in the tel_streams database tables. In particular, the tel_metadata streams
table maps the tables in the tel_streams database to the telemeter that records the samples.
Data comes from the clients (TCS
) and are stored on the mountain DB server,
. The data on
are immediately accessible through the Tucson web server,
. The data are also immediately replicated by the Tucson DB server,
. Once a week, the last week of binary logs on
are archived to tape so they can be used to rebuild the new telemetry databases on the Tucson SAN.
The data store activities are largely automated with one glaring hole. The tel_streams database on
will fill up unless the old samples are removed. At one time this was automated, but Tony stopped that on both servers for different reasons. On
, the automation was stopped because deleting data from tables substantially increases the workload on the server. This would interfere with the collection of new samples from high throughput telemeters. When this happened during science operations, it annoyed the OSAs by causing lots of messages to be written to the TCS
event log. On
, the automation was stopped because it would sometimes cause the LVM snapshot to fill up with changes.
As of Dec-2012, we are truncating the tel_streams database on the mountain every day - see the
cron job on
. Nothing is automatically cleaning up the Tucson slave database.
Telemetry Administrator E-Mail Group
The telemetry servers and
forward root email messages to the telemetry administrator email group,
. Each server runs the logwatch services that mines data on a daily basis from the server logs and emails root with a summary. As part of the summary, logwatch tells the remaining capacity of the server's filesystems.
regarding the disk space usage, log rotation, purging, update of the leap seconds file, and backups.
Instead of manually inspecting the servers to learn if data needs to be purged from their databases, you could add yourself to the telroot group. Currently, the
In addition to the mountain (
) machine, the cron jobs are sending email for the following machines which are not in use:
- tel-store-mtn.tucson.lbto.org (this machine was the replication client before
A useful email is one sent from
regarding the leap seconds file. It is a cron job. This cron job talks to a
server at the University of Colorado and copies a "leap seconds" file
to a local
in /lbt/UT (see email).
However, this is a local disk and not an NFS or SAN disk, so I am not
sure this is a good thing, or if it is properly read by the software.
-------- Original Message --------
Subject: Cron run-parts /etc/cron.weekly
Date: Sun, 2 Dec 2012 04:22:02 -0700
From: firstname.lastname@example.org (Cron Daemon)
--2012-12-02 04:22:01-- ftp://utcnist.colorado.edu/pub/leap-seconds.list
Resolving utcnist.colorado.edu... 220.127.116.11
Connecting to utcnist.colorado.edu|18.104.22.168|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD /pub ... done.
==> SIZE leap-seconds.list ... 9376
==> PASV ... done. ==> RETR leap-seconds.list ... done.
Length: 9376 (9.2K)
0K ......... 100% 83.0K=0.1s
2012-12-02 04:22:02 (83.0 KB/s) - `/lbt/UT/leap-seconds.list.new' saved 
Michele notes that in June when the database was still on tel-store-mtn, she actually got three useful emails from
indicating: the prebackup script was being run, the backup was about to begin,
and the postbackup script was being run.
Kellee has debugged the pre/post scripts on
and the pre-backup script now sends a message when it has copied the files to
to be backed up.
Purging Data from tel-collect.mountain.lbto.org
Currently (Dec-2012) we are running the
script as a daily cron job. The script is supposed to check the time and delete samples older than a calculated date. But, all of the guts are commented out and it's just doing a
on every table in the db. That means every day at 10am we are truncating all the mountain database tables.
Since it turns off binary logging first, the truncate is not replicated to Tucson.
Tony originally suggested:
To prevent the filesystem hold the database on tel-collect from filling up, samples older than three weeks occasionally need to be deleted from the streams tables. I usually do this when the /dev/sdb1 filesystem is more than 80% used.
I've written a script called truncate_tables that removes all of the samples in the tel_streams database older than three weeks. It is located on tel-collect in the /local/dbadmin/bin directory. It can be used as follows.
[root@tel-collect ~]# rm -f nohup.out
[root@tel-collect ~]# nohup /local/dbadmin/bin/truncate_tables &
I use nohup since this will take a long time, maybe a day or two.
The script emails telroot when it finishes. Don't trust the success information in the email; it's a relic from when the script did more. When the script finishes, scan the file nohup.out to see if there were any issues.
This script puts a lot of load on tel-collect, slowing the collection of telemetry. The PMCs will probably write 'sample buffer full' messages to the event log occasionally while the script is running. If the database needs to be purged during science time, try to start the script first thing in the morning. Also, consider temporarily disabling the PMC telemetry collection for the nights the script is running. This can be done through the PMC GUIs. Chris can show you how.
Purging Data from the Slave Tucson Database
Prior to the database moving to the SAN, Tony routinely deleted all samples from the database on
. He typically did this when the /dev/mapper/mysql_vg-mysql_lv
filesystem was more than 50% used. This should not
be done while the log files are being archived to tape. It will cause the LVM snapshot to fill up, causing problems.
Daily, they were also doing a LVM snapshot.
Tony had a
script that removes all of the samples fro the tel_streams database:
It is located on tel-store-mtn in the /lbt/telemetry directory. It can be used as follows.
As of Summer-2012, we are not purging data from the slave database running on
[root@tel-collect ~]# rm -f nohup.out
[root@tel-collect ~]# nohup /lbt/telemetry/truncate_tables &
I use nohup since this will take some time.
This script does not email telroot when it completes, so you'll have to monitor it. When it finishes, scan the file nohup.out for any problems.
Snapshot - Client Backup
machine runs the slave SQL server.
We are running a SAN snapshot of the db directory every evening (at 7? - ask Stephen). If we lose the db, but not the SAN, this is the simplest way to recover.
The snapshot is not
commands which stop/start replication before and after it runs. It used to, but it was turned off awhile back when it was suspected of causing replication problems - we might want to put it back in.
Archiving to Tape - Client Backup
machine runs the slave SQL server.
The files recording transactions done on the slave database are in
Each file is just over 1GB.
We only do full database backups periodically since they take days to perform. The binlogs are backed up weekly as incremental backups so that they can be applied to a database backup. Every Sunday, an automated backup job is run by the backup initiator that archives the
binary logs. Before it backs up the logs, it calls the script
that takes an LVM snapshot of the MySQL data. The backup job actually copies the snapshot to tape.
The pre-backup.sh file copies the files to a
directory and the backup job backs up from there. When the job finishes copying the logs to tape, it
calls the script
/tel-db/post-backup.s that deletes the snapshot and all of the copied binary logs.
can call the
, but we are running the script manually after verifying the backup.
The pre-backup script sends a message to telroot saying it moved files through
Unfortunately, even though the archiving of telemetry is automated, the backup initiator seems to fail often. Most commonly, it fails by not running at all. Kyle usually handles this by starting the job manually on Monday. If it has been more than a day since the pre-backup.sh email was received, ask Kyle if the telemetry backup job finished. If he says that it did, you'll need to manually execute the post-backup.sh script.
The scripts used are in
pre-backup.sh - stops slave with
safedb.sh, links all but the last two
tel1-bin.NNNNNN files to
post-backup.sh - should be run when the backup is verified, deletes binlog directory files, does a
purge binary logs up to the given logfile name in the db
safedb.sh - called by pre-backup, turns slave off with
unsafedb.sh - called by pre-backup, turns slave back on
- The weekly scheduled backup should run the
pre-backup.sh file and send email to telroot which binlog it backed up through.
- After the pre-backup script has run and the backup is in progress, you can see the files in
- When the backup completes (>12 hours) and it is verified, run the
post-backup.sh script with the argument of one filename past the backed up file.
For instance, if the last file in the backup was
tel1-bin.001188 , use
tel1-bin.001189 so that it deletes all the way through
001188 in the purge command:
If this isn't done before the next backup job, the next backup job will hang or fail outright.