Please note that the following notes apply to the MySQL database version of telemetry collection.
This page does not apply to any telemetry data collected since the end of January-2013.

Desired Configuration

This is the desired configuration after the migration to the SAN and the clustered data store.

MySQL 5.1 as the DBMS

All DBMSs should be MySQL 5.1. MySQL 5.1 provides partitioning and event scheduling. Both are features that telemetry will need to use to limit the growth of the backup window as the size of the tel_streams database grows and to automate the resulting maintenance tasks.

Master-slave configuration

The data store will be deployed on two machines, one on the mountain and one in Tucson, in master-slave configuration. The mountain data store will keep only most recent year of telemetry data. The Tucson data store will keep all telemetry data.

bin_log_rotate script

Replication happens by the slave reading the master's binary log and executing the queries on its databases. Since a master may have multiple slaves, there is no general way for it to know when all of the slaves have executed a given entry in the binary log. This means the master needs to be told to purge its binary log. I've written a script called bin_log_rotate that does this, but it needs to be called daily by the cron daemon on the mountain server.

As part of the backup strategy, this script keeps two extra days of log files to help recover data on the Tucson store that was collected between snapshots of the SAN.

Each database server is an active-pasive cluster

Both in Tucson and on the mountain the database server will be a two node, active-pasive cluster.

Stream tables partitioned by sample time

Splitting all of the stream tables by sample time adds several benefits. It decreases search times, because only the partitions containing relevant sample times are searched. It will also limit the duration of the backup window, because once a partition ceases to be modified, it no longer needs to be backed up.

Here's the partitioning strategy.

  • Partitioning will be done yearly with each year beginning on August 1 00:00 UTC of the same calendar year. For example, partition year 2010 begins 2010/08/01 00:00 UTC. This date was chosen because it's in the middle of summer shutdown, a natural boundary.

  • The current partition year will be subdivided by month. There is no current partition year partition. Instead there will be a group of monthly partitions each beginning at the beginning of the respective month in UTC.

  • The current month is further subdivided by week, so there is no current month partition. Instead there will be a group of weekly partitions each beginning on Sunday at 00:00 UTC.

An apology for the complicated partitioning scheme

This complicated partitioning scheme is the unfortunate result of balancing constraints with the desire to minimize the number of partitions per table.

The yearly partition granularity comes from the need for a reasonable backup window.

The monthly partition granularity comes from the needs to periodically remove old diagnostic data from tel_streams database and to have a reasonable partition management window. Old diagnostic data needs to be removed to reduce the amount of overall storage required for telemetry. The partition management window is the amount of time it takes to repartition all of the stream tables when new partitions are needed. This hasn't been measured, but if it is much longer than a day, the monthly partitions may need to be replaced with a year's worth of weekly partitions.

The weekly partition granularity comes from the desire for fast queries on the most recently collected samples. If the partition management window is much longer than a day, the weekly partitions may need to be replaced with a month's worth of daily partitions.

Maintaining the partitions

As time marches on, new partitions need to be added and old ones need to be merged or removed. Here are the basic maintenance tasks.

  • Before the beginning of every week, for each stream table, a new weekly partition will needs to be added to hold the samples collected during the upcoming week. Also, the last weekly partition holding samples around a month old needs to be merged into the most recent monthly partition.

  • Before the end of every month, for each stream table, a new monthly partition needs to be added to hold the samples that are currently in the weekly partitions. Also, the last monthly partition hold samples over a year old needs to be merged into the most recent yearly partition or deleted if the samples are diagnostic.

  • Before the end of every partition year, for each stream table that doesn't hold diagnostic samples, a new yearly partition needs to be added to hold the samples that are currently in the monthly partitions.

The installation script schedules an event that the DBMS will execute daily to perform the partition maintenance tasks. It uses a collection of stored procedures to perform these tasks and also cover the numerous corner cases not described here.

Adapt collection library for partitioning

The collection library will need to be updated so that when a new telemeter is registered, the table that is created to hold the samples is already partitioned table. This isn't urgent, because the partition maintenance logic will partition unpartitioned tables. To restrict the partition logic on the data store, the table creation logic that is currently in the collection library should be moved to the data store.

UTC leap second management by data store

To maintain the partitions, requires the ability to convert TAI times to UTC. To allow the maintenance tasks to be automated within the DBMS, stored procedures were added to do this conversion, and a table was added with knowledge of the current leap seconds.

update_leaps script

To automate keeping the leap seconds current, I've written a script called update_leaps. This script should be called weekly by cron on the mountain database server. The changes made to the mountain leap seconds table will be replicated in Tucson.

Convert collection library and web server to using data store for leap seconds

At the moment, the collection library and the access web application have their own ways of keeping current leap second information. Both of these should be converted to retrieving the current leap second information from the data store. This simplifies the deployment of telemetry by removing the need for /lbt/UT.

check_tables script

MyISAM database tables need to be routinely checked for corruption. I've written a script called check_tables that will check all of the stream tables for corruption and will send an email to telroot@lbto.org with the results. A cron job on each database server should be set up to call this script weekly.

Backup strategy

Both the mountain and Tucson database servers will use a SAN for storage. The SAN is configured to take daily snapshots and hold onto the snapshots for one week. N.B. I'm not sure that a database will remain consistent unless it is locked and all of its tables have been flushed prior to the snapshot being taken.

For the Tucson data store, to recover changes made to the databases between the failure and the last snapshot, the binary log on the mountain can be used. For the mountain the changes would need to be recovered from the tables on the Tucson data store.

In Tucson only, for disaster recovery purposes, the database will be backed up weekly. The backup will include the tel_metadata database, and the weekly, monthly and most recent yearly partitions of the tel_streams database. The previous yearly partitions should not be backed up, because they haven't changed.

Locking databases prior to backup

The Tucson databases will need to be locked and all of its tables flushed prior to the backup. Since a backup may take awhile, the filesystem shouldn't be backed up directly. Instead, a snapshot should be taken, and the snapshot should be backed up. Here is an outline of this strategy.

  1. Lock all databases and flush all open tables.
  2. Take a snapshot of all filesystems holding data that needs to be backed up.
  3. Unlock all databases.
  4. Backup the data.
  5. Delete the snapshot.

Unfortunately, I did not have time to automate this with a script.

Web server

There should be a web server on the mountain as well as in Tucson. Each web server will host the telemetry access web application and will be configured to use the database server at the same site. For instance, the Tucson web server will use the Tucson database server to access telemetry data.

Each web server will be a two node, active-passive cluster. It should use the same two machines as the database server, but with the opposite machine being active. For instance, if the nodes are named tel1 and tel2, and the database server prefers to be active on tel1, the web server should prefer to be active on tel2.

Deployment

Mountain

There are two server machines dedicated to telemetry: tel1.mountain.lbto.org and tel2.mountain.lbto.org. These form the nodes of the active-passive cluster hosting the database and web servers.

Database Server

The database server should have the domain name tel-db.mountain.lbto.org with address 192.168.38.11. The mountain firewall maps the public IP address 128.196.248.94 to the database server, so this is the address used by the Tucson DBMS to connect to the mountain DBMS over port 3306 during replication.

Web Server

The web server should have the domain name telemetry.mountain.lbto.org.

Common directory structure

The directory structure of the telemetry-related files on both machines should be as follows.

/tel-db/
    bin/
      bin_log_rotate
      check_tables
      telextract
      update_leaps

   current/

   etc/
      tel-db.cnf

   recent/
      bin-log/
      months/

/tel-tmp/

/tel-web/

The /tel-db/ directory holds the database server deployment. The bin/ subdirectory holds the housekeeping scripts. The current/ subdirectory holds the DBMS's primary data directory. The etc/ subdirectory holds the DBMS configuration file. The recent/bin-log/ subdirectory holds the binary logs. Finally, the recent/months/ subdirectory holds the monthly partitions.

The /tel-tmp/ directory is used by both the web server and the database server for temporary files.

The /tel-web/ directory holds the web server deployment.

Filesystems

The telemetry deployment requires a few dedicated filesystems. To make the clustering of the database server work, the DBMS's data needs be shared by both nodes in the cluster. This is done by storing the data on the SAN. Also, the database server can create very large temporary files, so it is prudent to confine those temporary files to their own filesystem.

Here's a table that maps telemetry-related directories to their filesystems. It indicates the location of the filesystem, the size of the filesystem, and whether it is shared by the cluster nodes.

directory filesystem location size shared
/tel-db/bin/ N/A < 1 GB no
/tel-db/current/ SAN, fast disk 1000 GB yes
/tel-db/etc/ N/A < 1 GB no
/tel-db/recent/ SAN, slow disk 7000 GB yes
/tel-tmp/ local disk 100 GB no
/tel-web/ N/A < 1 GB no

Data Store Setup

Build

Checkout telemetry/branches/r17 from LBTO's version control repository. In the store/ directory, make the production target. For instance, the following could be done on a development machine.

[you@localhost ~]# svn co https://svn/repos/telemetry/branches/r17 tel17
[you@localhost ~]# cd tel17/store
[you@localhost store]# make production

This will create the subdirectory production/ with the following contents.

production/
   admin_test

   client/
      my.cnf

   primary/
      bin_log_rotate
      check_tables
      init_dbms.sql
      tel-db.cnf
      telextract
      update_leaps

   secondary/
      check_tables
      init_dbms.sql
      tel-db.cnf
      telextract

The production/primary/ subdirectory contains all the files needed for installing the data store.

Installation

The following table indicates where the product runtime files in production/primary should be installed in the /tel-db/ directory on both tel1 and tel2. The init_dbms.sql is not deployed. This file is only needed during the initialization of the DBMS.

product in primary/ installed location
bin_log_rotate bin/bin_log_rotate
check_tables bin/check_tables
telextract bin/telextract
update_leaps bin/update_leaps
tel-db.cnf etc/tel-db.cnf

Once the runtime files are in place, the DBMS should be initialized. This should be done on only the active cluster node. tel1 is the preferred active node for the DBMS.

To initialize the DBMS, first make sure the mysql user owns everything inside the directories /tel-db/current and /tel-db/recent. For instance, the following could be done.

[root@tel1 ~]# chown -R mysql.mysql /tel-db/current /tel-db/recent

Next, the MySQL grants tables need to be initialized using the mysql_install_db command with the data directory being /tel-db/current. For instance, the following could be done.

[root@tel1 ~]# mysql_install_db --datadir=/tel-db/current

Next, the DBMS needs to be started.

[root@tel1 ~]# mysqld_safe --defaults-file=/tel-db/etc/tel-db.cnf &

Now the root account needs to be configured so that either node of the cluster can be used as the host for a root account connection, and root's password needs to be set to the standard telemetry root password. N.B. There is very little reason to not let root connect from any where, so I just open it up.

[root@tel1 ~]# mysql --default-file=/tel-db/etc/tel-db.cnf -u root
mysql> RENAME USER root@'tel1.mountain.lbto.org' TO root@'%';
mysql> SET PASSWORD FOR root@'%' = PASSWORD('???');
mysql> SET PASSWORD FOR root@localhost = PASSWORD('???');
mysql> SET PASSWORD FOR root@127.0.0.1 = PASSWORD('???');
mysql> quit

Next, the telemetry databases and accounts need to added to the DBMS. The init_dbms.sql SQL script that was created during the build is used for this task. For instance, if might be done as follows.

[you@localhost store]# mysql -h 128.196.248.94 -u root -p '???' < production/primary/init_dbms.sql

Finally, on both tel1 and tel2, set up the housekeeping cron jobs as previously discussed.

[root@tel1 ~]# cd /etc/cron.daily
[root@tel1 cron.daily]# ln -s /tel-db/bin/bin_log_rotate
[root@tel1 cron.daily]# cd /etc/cron.weekly
[root@tel1 cron.weekly]# ln -s /tel-db/bin/check_tables
[root@tel1 cron.weekly]# ln -s /tel-db/bin/update_leaps 

The installation of the data store is complete. However, Stephen will need to update the cluster management before it can be used.

Web Server Setup

Build

Checkout telemetry/branches/r17 from LBTO's version control repository. In the top directory, make the web target. For instance, the following could be done on a development machine.

[you@localhost ~]# svn co https://svn/repos/telemetry/branches/r17 tel17
[you@localhost ~]# cd tel17
[you@localhost store]# make web

This will produce the build/ subdirectory contain the telaccess.war file.

Installation

Unfortunately, I didn't have enough time to finish figuring out the configuration of the web server, so this is left as an exercise for the reader.

Client Setup

A collection client needs to have the MySQL client shared library installed. It will also need to have access to collection shared library and the DBMS client configuration file.

N.B. Unfortunately, I've run out of time. The client configuration is not really changing, so it should be fairly obvious how it will be by inspecting how it currently is.

Converting old telemetry names

A few years ago, there was a lot for freedom in naming telemetry entities. Restrictions in various file formats that need to be supported forced restrictions telemetry names. These restrictions are documented elsewhere.

Unfortunately, the databases were not able to be converted before the catastrophic failure last year. Once the databases have been fully restored, the names that don't obey the new restrictions will need to be converted to compliant versions. This should be done as part of the SAN migration.

Merging identical stream tables

When telemetry was first conceived, every time a telemeter was registered, a new telemetry stream was created. This meant every time a client process was stopped and restarted, it would generate a new telemetry stream. There would be one-to-many mapping between telemeter and stream.

This quickly created tens of thousands of telemetry streams. Management became unwieldy. To improve the situation, telemetry was converted so that a new stream would be created only if the definition of the telemeter had changed. This seemed to work.

A few months ago, it was discovered there was a bug in the code that determined whether or not a telemeter's definition had changed. Some telemeters where still getting a new stream every time they were registered. This has been fixed, but the old streams have not been combined.

Once the old, illegal telemetry names have been converted, streams with identical definitions should be merged into a single stream. This would be easier to do prior to partitioning.

New diagnostic requirement

At first, telemetry allowed for diagnostic streams to have a lifespan defined by the client. With the need to keep the backup window a reasonable duration, the lifespan needed to have an upper bound. To make things simple, the diagnostic lifespan has been fixed to be one year to coincide with the yearly partition.

This change has not been implemented. It should be implemented once the database have been fully restored, but prior to the initial partitioning.

Web server on tel-test

To improve the testing of telemetry collection by the TCS on the test cluster, a test version of the telemetry web server should be deployed on tel-test.

-- TonyEdgin - 18 May 2012
Topic revision: r10 - 07 Mar 2013, KelleeSummers
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback