Please note that the following notes apply to the MySQL database version of telemetry collection.
This page does not apply to any telemetry data collected since the end of January-2013.

Database Recovery Procedure

When the mountain telemetry database server, tel-collect, or the mountain SAN storage node loses power abruptly, the mountain databases are almost certainly corrupted. There is no guaranteed fix when this happens, but the following procedure usually works.

When power is restored to the mountain, but before the ovms server is powered on and before the MCSPU and TCS are started, do the following.

1) Power on tel-collect.

2) When tel-collect is fully up, log in as root and stop the mysqld process.

[you@somewhere ~]# ssh
[root@tel-collect ~]# service mysqld stop

At this time, the ovms server may be powered on, and the MCSPU and TCS may be started.

3) Disable remote connectivity to the database by uncommenting the bind-address entry in the /etc/my.cnf file.

4) Start the mysqld process.

[root@tel-collect ~]# service mysqld start

If this fails, try to start it with mysqld_safe.

[root@tel-collect ~]# mysqld_safe &

If this also fails, the mysql system tables have been damaged. frown, sad smile I've never seen this happen, so I don't have a procedure for this. You'll need to go to the on-line MySQL documentation,

5) Connect to the metadata database with mysql using the MySQL root password instead of ???.

[root@tel-collect ~]# mysql --user=root --password='???' --database=tel_metadata

6) We don't want to replicate the repairs on the slave, so disable the logging of our actions for this session.

mysql> SET sql_log_bin = 0;

7) Perform an extended check on the metadata tables, and repair any where the last message is not a status of 'OK'.

mysql> CHECK TABLE component_children, components, enum_vals, measures, numeric_measures, streams, systems, telemeters, units EXTENDED;
| Table                           | Op    | Msg_type | Msg_text                                              |
| tel_metadata.component_children | check | warning  | 1 client is using or hasn't closed the table properly | 
| tel_metadata.component_children | check | status   | OK                                                    | 
| tel_metadata.components         | check | Error    | Table 'tel_metadata.components' needs some TLC.       | 
| tel_metadata.components         | check | error    | Corrupt                                               | 

For this example, the components table needs to be repaired. Do the following to repair it.

mysql> REPAIR TABLE components EXTENDED;

If a metadata table repair fails, you are no longer on the happy path through this procedure. frown, sad smile This hasn't failed since we moved to the band-aid telemetry solution, so I don't have a procedure for this. You'll need to go to on-line MySQL documentation,

8) Once the metadata database is in order, disconnect from the database.

mysql> quit;

9) Now we need to determine which streams tables are broken. I've written a script for this. To check the streams tables, do the following.

[root@tel-collect ~]# rm -f nohup.out
[root@tel-collect ~]# nohup /local/dbadmin/bin/ &

This takes some time, so I use nohup in case the connection to the mountain is lost.

10) Once the script finishes, find all of the tables that have been identified as corrupted and try the quick repair procedure on them. I've written a script for the quick repair. To repair the streams tables, do the following.

[root@tel-collect ~]# grep Corrupt nohup.out | cut -f1 > bad_tables
[root@tel-collect ~]# /local/dbadmin/ < bad_tables > r1_results 2>&1 &
[root@tel-collect ~]# disown -h

This takes some time, so I disown the process in case I lose my connection to the mountain. (I can't figure out how to redirect input properly using nohup, so I use disown.)

11) When the quick repair script finishes, find all the tables that couldn't be repaired. and try the not-so-quick repair procedure on them. I've written a script for this type of repair as well. Do the following, but please scan the r1_results to see if there is anything unusual.

[root@tel-collect ~]# grep error r1_results | cut -f1 | uniq > still_bad_tables
[root@tel-collect ~]# /local/dbadmin/ < still_bad_tables > r2_results 2>&1 &
[root@tel-collect ~]# disown -h

Depending on how many tables there are to repair, this could take quite a while.

12) When the not-so-quick repair script has finished, browse the contents of r2_results. Usually, all the tables will have been successfully repaired. If this is the case, skip to step 13). Otherwise, please go to MySQL's web page to learn more of what can be done. If all else fails, you can delete the table. The data has been replicated by the slave, so no data loss will occur. Also the collection library will notice the table is missing and recreate it. However, the deleted data will not be accessible to the users until the SAN migration is complete. To delete the data table, do the following.

[root@tel-collect ~]# mysql --user=root --password='???' --database=tel_streams -A
mysql> SET sql_log_bin = 0;
mysql> DROP TABLE ndead_beef; /* where ndead_beef is the name of the table to drop */
mysql> quit;

If this fails, use a bigger hammer. Stop the mysqld process and delete the corresponding files. I.e.

<root@tel-collect ~]# service mysqld stop
<root@tel-collect ~]# rm -i /local/telemetry/tel_streams/ndead_beef.*
<root@tel-collect ~]# service mysqld start

13) If the tables have all been repaired, its time to clean up and enable remote connections to the server. Enable remote connectivity to the database by commenting out the bind-address entry in the /etc/my.cnf file, and restarting the mysqld process.

[root@tel-collect ~]# service mysqld restart
[root@tel-collect ~]# rm -f r2_results still_bad_tables r1_results bad_tables nohup.out

14) Almost finally, have other people restart the ovms-svc process on ovms, MCSPU on jet and the TCS subsystems.

15) A power failure corrupts the binary logs as well. You deal with these on the slave, When a slave replicates a master, it first copies the statements being replicated to a local copy, called a relay log. Then in a separate thread, it executes the replicated statements. The first thread is called the 'IO thread', and the second is called the 'SQL thread'. A corrupted binary log can be caught during either or both of these operations. Before moving to the slave, let's note where the master is currently writing in the binary log.

[root@tel-collect ~]# mysql --user=root --password='???'
*************************** 1. row ***************************
            File: bin_log.003255
        Position: 188973629
Binlog_Ignore_DB: mysql,test

It's just about to write to position 188973629 in the file bin_log.003255.

16) Next, log in to the tel1 as root, connect to the slave DBMS on tel1, and check the slave's status.

[you@somewhere ~] ssh
[root@tel1 ~]# mysql --user=root --password='???'
*************************** 1. row ***************************
                  Master_User: tel_replicator
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin_log.003255
          Read_Master_Log_Pos: 188973629
               Relay_Log_File: mysql-relay-bin.000911
                Relay_Log_Pos: 570143490
        Relay_Master_Log_File: bin_log.003255
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                   Last_Errno: 0
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 188973629
              Relay_Log_Space: 188973629
              Until_Condition: None
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
               Last_SQL_Errno: 0

Notice that Relay_Master_Log_File is bin_log.003255, and Exec_Master_Log_Pos is 188973629. This is the position of the slave's SQL thread with respect to the master's log file. Since Last_SQL_Error and Last_IO_Error are blank, no errors have been found. In this example, the slave is at the same position as the master, meaning the binary log was not corrupted. This is what you hope to see, but you won't. N.B. For the remaining examples, I'm going to show only the interesting fields in the SHOW SLAVE STATUS\G output to reduce the size of this wiki page.

Here's an example of what it would look like if the binary log corruption was detected in the IO thread.

*************************** 1. row ***************************
              Master_Log_File: bin_log.003254
          Read_Master_Log_Pos: 200000000
        Relay_Master_Log_File: bin_log.003253
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
          Exec_Master_Log_Pos: 887329
        Seconds_Behind_Master: NULL
                Last_IO_Errno: 666
                Last_IO_Error: Crap!  Some semi-sensical message describing an error.
               Last_SQL_Errno: 0

Notice that the Last_IO_Error field is not blank. The Master_Log_File and Read_Master_Log_File tell us that IO thread found the corruption at position 200000000 in the master log file bin_log.003254. To solve this problem, we need to skip past this corruption. But first we need to let the SQL thread catch up to the IO thread. When Relay_Master_Log_Files equals Master_Log_File and Exec_Master_Log_Pos equals Read_Master_Log_Pos execute something like the following.

mysql> STOP SLAVE;

In other words, change to the beginning of the next master log file.

It appears that this would cause us to lose more data, but it doesn't. First, the binary log is written to before the tables are committed to disk, so the binary log is ahead of the tables. The corruption occurred in a statement whose effect was not able to be stored on the master's disk. Second, when the mysqld process was started on the master after the power failure, it created the next binary log file and started writing to it. This means the corruption caused by a power failure always occurs at the end of a binary log file.

Here's an example of what it would look like if the log corruption was detected in the SQL thread.

*************************** 1. row ***************************
              Master_Log_File: bin_log.003255
          Read_Master_Log_Pos: 188973629
        Relay_Master_Log_File: bin_log.003254
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
          Exec_Master_Log_Pos: 200000000
        Seconds_Behind_Master: NULL
                Last_IO_Errno: 0
               Last_SQL_Errno: 666
               Last_SQL_Error: Crap!  Some semi-sensical message describing an error.

Notice that the Last_SQL_Error field is not blank. To solve this problem, we again need to skip past the corruption. To skip a corrupted statement, do the following.

mysql> SET GLOBAL sql_slave_skip_counter = 1;

You may need to repeat a couple of times.

Congratulations! You've put Humpty Dumpty back together again.

-- TonyEdgin - 01 May 2012
-- KelleeSummers - 04 Dec 2012: Updated tel-store-mtn to tel1
Topic revision: r4 - 07 Mar 2013, KelleeSummers
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback