Please note that the following notes apply to the MySQL database version of telemetry collection.
This page does not apply to any telemetry data collected since the end of January-2013.

Telemetry Data Maintenance Procedures

Tony's notes from May-2012.
Kellee has added a few notes about the telroot mail usage, current slave backup procedures, changed tel-store-mtn references to tel1

Background

Telemetry logically consists of two databases. The tel_streams database contains all of the samples. Each table contains one and only one telemetry stream. The tel_metadata database contains all the descriptive data including the structural data of the samples in the tel_streams database tables. In particular, the tel_metadata streams table maps the tables in the tel_streams database to the telemeter that records the samples.

Data comes from the clients (TCS, MCSPU and OVMS) and are stored on the mountain DB server, tel-collect.mountain.lbto.org. The data on tel-collect are immediately accessible through the Tucson web server, telemetry.lbto.org. The data are also immediately replicated by the Tucson DB server, tel1.tucson.lbto.org. Once a week, the last week of binary logs on tel1 are archived to tape so they can be used to rebuild the new telemetry databases on the Tucson SAN.

The data store activities are largely automated with one glaring hole. The tel_streams database on tel-collect and tel1 will fill up unless the old samples are removed. At one time this was automated, but Tony stopped that on both servers for different reasons. On tel-collect, the automation was stopped because deleting data from tables substantially increases the workload on the server. This would interfere with the collection of new samples from high throughput telemeters. When this happened during science operations, it annoyed the OSAs by causing lots of messages to be written to the TCS event log. On tel1, the automation was stopped because it would sometimes cause the LVM snapshot to fill up with changes.

As of Dec-2012, we are truncating the tel_streams database on the mountain every day - see the /local/dbadmin/truncate_tables cron job on tel-collect.mountain.lbto.org. Nothing is automatically cleaning up the Tucson slave database.

Telemetry Administrator E-Mail Group

The telemetry servers and tel1.tucson.lbto.org forward root email messages to the telemetry administrator email group, telroot@lbto.org. Each server runs the logwatch services that mines data on a daily basis from the server logs and emails root with a summary. As part of the summary, logwatch tells the remaining capacity of the server's filesystems.

regarding the disk space usage, log rotation, purging, update of the leap seconds file, and backups.

Instead of manually inspecting the servers to learn if data needs to be purged from their databases, you could add yourself to the telroot group. Currently, the telroot group contains shooper and ksummers.

In addition to the mountain (tel-collect.mountain.lbto.org) machine, the cron jobs are sending email for the following machines which are not in use:
  • telemetry.lbto.org
  • tel-store-mtn.tucson.lbto.org (this machine was the replication client before tel1)
  • tel-test.tucson.lbto.org

A useful email is one sent from root@telemetry.tucson.lbto.org regarding the leap seconds file. It is a cron job. This cron job talks to a server at the University of Colorado and copies a "leap seconds" file to a local disk on telemetry.tucson.lbto.org in /lbt/UT (see email). However, this is a local disk and not an NFS or SAN disk, so I am not sure this is a good thing, or if it is properly read by the software.

-------- Original Message --------
Subject:    Cron run-parts /etc/cron.weekly
Date:    Sun, 2 Dec 2012 04:22:02 -0700
From:    root@telemetry.tucson.lbto.org (Cron Daemon)
To:    root@telemetry.tucson.lbto.org


/etc/cron.weekly/get_tai_offsets:

--2012-12-02 04:22:01--  ftp://utcnist.colorado.edu/pub/leap-seconds.list
           => `/lbt/UT/leap-seconds.list.new'
Resolving utcnist.colorado.edu... 128.138.140.44
Connecting to utcnist.colorado.edu|128.138.140.44|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD /pub ... done.
==> SIZE leap-seconds.list ... 9376
==> PASV ... done.    ==> RETR leap-seconds.list ... done.
Length: 9376 (9.2K)

     0K .........                                             100% 83.0K=0.1s

2012-12-02 04:22:02 (83.0 KB/s) - `/lbt/UT/leap-seconds.list.new' saved [9376]

Michele notes that in June when the database was still on tel-store-mtn, she actually got three useful emails from root@tel-store-mtn.tucson.lbto.org
indicating: the prebackup script was being run, the backup was about to begin, and the postbackup script was being run.
Kellee has debugged the pre/post scripts on tel1 and the pre-backup script now sends a message when it has copied the files to binlog to be backed up.

Purging Data from tel-collect.mountain.lbto.org

Currently (Dec-2012) we are running the /local/dbadmin/bin/truncate_tables script as a daily cron job. The script is supposed to check the time and delete samples older than a calculated date. But, all of the guts are commented out and it's just doing a TRUNCATE TABLE on every table in the db. That means every day at 10am we are truncating all the mountain database tables. Since it turns off binary logging first, the truncate is not replicated to Tucson.

Tony originally suggested:
To prevent the filesystem hold the database on tel-collect from filling up, samples older than three weeks occasionally need to be deleted from the streams tables. I usually do this when the /dev/sdb1 filesystem is more than 80% used.

I've written a script called truncate_tables that removes all of the samples in the tel_streams database older than three weeks. It is located on tel-collect in the /local/dbadmin/bin directory. It can be used as follows.
[root@tel-collect ~]# rm -f nohup.out
[root@tel-collect ~]# nohup /local/dbadmin/bin/truncate_tables &

I use nohup since this will take a long time, maybe a day or two.

The script emails telroot when it finishes. Don't trust the success information in the email; it's a relic from when the script did more. When the script finishes, scan the file nohup.out to see if there were any issues.

This script puts a lot of load on tel-collect, slowing the collection of telemetry. The PMCs will probably write 'sample buffer full' messages to the event log occasionally while the script is running. If the database needs to be purged during science time, try to start the script first thing in the morning. Also, consider temporarily disabling the PMC telemetry collection for the nights the script is running. This can be done through the PMC GUIs. Chris can show you how.

Purging Data from the Slave Tucson Database

Prior to the database moving to the SAN, Tony routinely deleted all samples from the database on tel-store-mtn.tucson.lbto.org. He typically did this when the /dev/mapper/mysql_vg-mysql_lv filesystem was more than 50% used. This should not be done while the log files are being archived to tape. It will cause the LVM snapshot to fill up, causing problems.

Daily, they were also doing a LVM snapshot.

Tony had a truncate_tables script that removes all of the samples fro the tel_streams database:
It is located on tel-store-mtn in the /lbt/telemetry directory. It can be used as follows.
[root@tel-collect ~]# rm -f nohup.out
[root@tel-collect ~]# nohup /lbt/telemetry/truncate_tables &

I use nohup since this will take some time.

This script does not email telroot when it completes, so you'll have to monitor it. When it finishes, scan the file nohup.out for any problems.
As of Summer-2012, we are not purging data from the slave database running on tel1.tucson.lbto.org.

Snapshot - Client Backup

The tel1.tucson.lbto.org machine runs the slave SQL server.

We are running a SAN snapshot of the db directory every evening (at 7? - ask Stephen). If we lose the db, but not the SAN, this is the simplest way to recover.

The snapshot is not running the safedb.sh and unsafedb.sh commands which stop/start replication before and after it runs. It used to, but it was turned off awhile back when it was suspected of causing replication problems - we might want to put it back in.

Archiving to Tape - Client Backup

The tel1.tucson.lbto.org machine runs the slave SQL server.

The files recording transactions done on the slave database are in /tel-db/mysql/tel1-bin.NNNNNN Each file is just over 1GB.

We only do full database backups periodically since they take days to perform. The binlogs are backed up weekly as incremental backups so that they can be applied to a database backup. Every Sunday, an automated backup job is run by the backup initiator that archives the tel1 binary logs. Before it backs up the logs, it calls the script /tel-db/pre-backup.sh that takes an LVM snapshot of the MySQL data. The backup job actually copies the snapshot to tape. The pre-backup.sh file copies the files to a binlog directory and the backup job backs up from there. When the job finishes copying the logs to tape, it calls the script /tel-db/post-backup.s that deletes the snapshot and all of the copied binary logs. can call the post-backup.sh, but we are running the script manually after verifying the backup.

The pre-backup script sends a message to telroot saying it moved files through tel1-bin.NNNNNN.

Unfortunately, even though the archiving of telemetry is automated, the backup initiator seems to fail often. Most commonly, it fails by not running at all. Kyle usually handles this by starting the job manually on Monday. If it has been more than a day since the pre-backup.sh email was received, ask Kyle if the telemetry backup job finished. If he says that it did, you'll need to manually execute the post-backup.sh script.

The scripts used are in /tel-db :
  • pre-backup.sh - stops slave with safedb.sh, links all but the last two tel1-bin.NNNNNN files to binlog directory
  • post-backup.sh - should be run when the backup is verified, deletes binlog directory files, does a purge binary logs up to the given logfile name in the db
  • safedb.sh - called by pre-backup, turns slave off with flush-logs
  • unsafedb.sh - called by pre-backup, turns slave back on

Procedure
  1. The weekly scheduled backup should run the pre-backup.sh file and send email to telroot which binlog it backed up through.
  2. After the pre-backup script has run and the backup is in progress, you can see the files in /tel-db/binlog
  3. When the backup completes (>12 hours) and it is verified, run the post-backup.sh script with the argument of one filename past the backed up file.
    For instance, if the last file in the backup was tel1-bin.001188 , use tel1-bin.001189 so that it deletes all the way through 001188 in the purge command:
    post-backup.sh tel1-bin.001189

If this isn't done before the next backup job, the next backup job will hang or fail outright.
Topic revision: r16 - 08 May 2014, KelleeSummers
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback