2022-03-29 - Test new housekeeper and test changing to which banks set_z0 writes (Switch BCU communication problem)
Software testing queue
AOeng@sxadsec:~/scripts/tests/bm.20220329.set-z0-banks/run_ncpa.pro, ~/scripts/tests/bm.20220329.hk-partial-timeout/*.shNote: the exact location of the script might change, as I name the directory after the date the test is performed and usually update the name the day before or day of the test. To find it, just cd into the parent directory and look for set_z0_banks. I will do my best to update the wiki if the location changes. - BM
Software versions used:
sxadsec: stable, custom set_z0 routines located in ~/scripts/tests/bm.20220329.set-z0-banks, test.20220329.hk-partial-timeout.
Software version after test:
Test: New Housekeeper
Servers and software versions used:
soul-sxwfs: stable (no change)
To get more information about the state of the AdSec
when the switch BCU times out, an ATT task was to modify the Housekeeper so that it would continue to log telemetry from other BCUs if the Switch BCU failed. Previously, the Housekeeper would send all requests for data in one large function and, if any one command failed, abort, wait 1s, and try to send all the commands again before logging the telemetry. This means that if any of the switch BCU commands failed, even if commands to other BCUs succeeded, none of the data would get logged to telemetry unless the switch BCU started responding again.
The new housekeeper separates out error handling for each BCU request individually and if the request fails, will mark all the diagnostic variables affected by the request as being unfit for logging. When telemetry is logged, a line will be skipped if it contains any diagnostic variables that have this flag. The Housekeeper will then attempt to request the data on its next cycle (in 1s).
The new housekeeper has been tested in the past, but had some memory allocation bugs that needed to be addressed, as it would just cause a segmentation fault as soon as it started. I fixed this, so it's ready to test again. We need to test that the new housekeeper is producing all the correct telemetry. We'll later need to test how it responds in the event of a switch BCU after we've determined that it's operating normally. On sxadsec, we will run the system in closed loop as normal, with the new housekeeper, and verify that all the Housekeeper
telemetry variables are being reported and look appropriate.
- Tail the Housekeeper logs and telemetry:
tail -f $ADOPT_LOG/`date +%Y/%m/%d`/housekeeper*
- In one terminal, print unique telemetry identifiers in housekeeper.L_* for a past date (or before switching versions):
cd $ADOPT_LOG/`date +%Y/%m/%d`/[past-DD]
sh ~/scripts/tests/bm.20220329.hk-partial-timeout/cat_unique_telemetry.sh > vars.txt
- In another terminal, tail current logs and print all unique telemetry identifiers encountered:
cd $ADOPT_LOG/`date +%Y/%m/%d`
sh ~/scripts/tests/bm.20220329.hk-partial-timeout/tail_unique_telemetry.sh # prints names to standard output
- Verify that all telemetry identifiers from (2) are also being logged in (3), with the possible exception of those in all caps, which are warnings and errors.
- Verify that all the values look right (i.e. that the right number of values are being reported and that they aren't always NaN, Inf, or 0 when they shouldn't be).
while read var; do echo -e \\n$var && cat housekeeper.L_TELEMETRY* | grep $var | tail -n 2; done <vars.txt
If the new Housekeeper is not saving telemetry properly or crashes, we can use the stable housekeeper. Otherwise, we can use the new housekeeper for the following set_z0 test.
Test: Change set_z0 banks
There is a hypothesis that the switch BCU timeouts occur when we are writing to the reconstructor matrix in the same memory that is currently being read by the AdSec
, causing some sort of race condition. $ADOPT_ROOT/idl/adsec_lib/adsec_utilities/set_z0.pro currently writes to both banks A and B. If it is the case that writing to Bank A while it is being read is causing the switch BCU timeouts, we could implement bank swapping in the manner of LN to prevent it. Previous switch BCU timeouts have happened while sending NCPA updates with the set_z0 command and, at 900Hz, took less than one hour to occur.
To test whether this is the case, this test will change set_z0.pro to write to either only Bank A or only Bank B. Bank B is not currently being used, so writing to it will not update the reconstructor matrix, but it will tell us that there is no race condition if writing to Bank B still fails. If we test only writing to Bank A, and it does not fail, then maybe the problem has something to do with data throughput from writing to both banks or something else.
The plan is to start at a higher, but safe, rate of 1200Hz. If it is the case that the SBCU timeout is occuring with some probability each time the reconstructor matrix is read, this might make the errors occur sooner than running at 900Hz.
The test will be run with Gain=0. For these tests, we should use the AOS Command GUI to close the AO loop and then set Gain=0. Optimize Gain and Gopt are not needed.Of course, this setup in not the same as we have on-sky, in Full AO Closed Loop (Gain not equal to 0, GOPT On). Is there a concern that Gain=0 and no GOPT will change the outcome of this test (ie no errors)? - DM
To test sending set_z0 with different banks, in the adsceng IDL terminal:
run_ncpa,count=720,delay=5,bank=0 ; 0 for both, 1 for A, 2 for B
The run_ncpa script loops a number of times determined by "count", calling custom set_z0 routines that are in the script's local directory which either write to Bank A or Bank B. If both Bank A and Bank B are selected, it will use the ordinary set_z0 routine from the stable distrubtion. The custom routines are the exact same as set_z0.pro, except that they have the write messages from the other bank commented out.
Test (might take up to 3 hours, likely much less):
To watch for timeouts:
tail -f $ADOPT_LOG/`date +%Y/%m/%d`/mirror* | grep -i timeout
To watch for "wrong filotto" errors, which might be occuring at the same cadence as set_z0 commands (see IT 8601
tail -f $ADOPT_LOG/`date +%Y/%m/%d`/master* | grep -i wrong
Each test will be run in closed loop at 1200Hz and will send NCPA updates to varying banks. Based on the previous amount of time it has taken for the timeouts to occur, for each test, we can run it for one hour to determine if there will be a timeout. If there is no timeout in one hour, we will assume that there is no problem.
TODO: Is one hour an appropriate amount of time to wait before concluding there won't be any timeouts? Or at least to suggest we might try a longer duration stress test later? - BM
TODO: A previous test of set_z0 at 1200 Hz caused an error in about 35 minutes. An hour for each Bank B and Bank A test should be enough time. If and when we get a Switch BCU error in test 2, Bank A only, or in test 3, Bank A+B, this will give us an indication if 1 hour with no errors is long enought. For the Bank A+B test we should run until an error occurs, or telescope handover arrives. - DM_
- Write using both banks as a baseline to verify that the problem is occuring and get a baseline for how long it takes to show up at 1200Hz.
run_ncpa, count=720, delay=5, bank=0
TODO: Have we decided to do this "control" test or not? - BM
TODO: Maybe we perform the Bank B only test first. If no errors then perform the Bank A only test. If error, then done. If no error, then run Bank A+B test. - DM
- Write using only Bank B.
run_ncpa, count=720, delay=5, bank=2
- Write using only Bank A.
run_ncpa, count=720, delay=5, bank=1
- 22 Feb 2022