2022-05-10 - switch bcu communication problem: test changing to which banks set_z0.pro writes (re-do with FAKESLOPE enabled and fixed IDL procedures)
- Software: Brandon
- AO: Doug
- Scripts used: AOeng@sxadsec:~/scripts/tests/bm.20220419.set-z0-banks (gist)
- Software versions used:
- sxadsec: stable (with housekeeper manually run from ~/soul/test.20220419.hk-partial-timeout)
- soul-sxwfs: stable
- Software version after test:
- sxadsec: stable
- soul-sxwfs: stable
Test: Change set_z0 banks
This test is the same as the previous set_z0 test, except that we want to:
Since the Housekeeper tests succeeded last time, we can continue using the new Housekeeper, but should keep an eye on the Housekeeper logs and telemetry in case of a SwitchBCU timeout to verify that it continues to log telemetry for the other BCUs.
- Make sure that slopecompctrl.L.FAKESLOPE.ENABLED.CUR=1 during the entire test. Last time, the SlopeCompCtrl process restarted before the test was run on the second day. So, the non-gain-zero tests Doug was able to do didn't have it enabled, which means that although the same write operations were occurring, the ASM was not adding the NCPA coefficients to the mode vector.
- Run set_z0 on different banks. Last time, we ran into some trouble doing this because I guess in the idl_ctrl terminal GUI, you can't run adsec commands inside of for loops well, because they'll be queued to run only after the execution of the entire command. This time I created a python script instead that uses asm.ex to cd to the script's path and run the modified set_z0 procedure. I've tested this in a VM and it should work.
There is a hypothesis that the switch BCU timeouts occur when we are writing to the reconstructor matrix in the same memory that is currently being read by the AdSec
, causing some sort of race condition. There are two areas of memory for the reconstructor matrix, banks A and B, presumably designed so that you can write to one while the other is being read and then swap which bank is being written to/read from after you write new values. $ADOPT_ROOT/idl/adsec_lib/adsec_utilities/set_z0.pro currently writes to both banks A and B. If it is the case that writing to Bank A while it is being read is causing the switch BCU timeouts, we could implement bank swapping, which is already used by LN, to prevent it. Previous switch BCU timeouts have happened while sending NCPA updates with the set_z0 command and, at 900Hz, took less than one hour to occur.
To test whether this is the case, this test will change set_z0.pro to write to either only Bank A or only Bank B. Bank B is not currently being used, so writing to it will not update the reconstructor matrix, but it will tell us that there is no race condition if writing to Bank B still fails. If we test only writing to Bank A, and it does not fail, then maybe the problem has something to do with data throughput from writing to both banks or something else.
The plan is to start at a higher, but safe, rate of 1200Hz. If it is the case that the SBCU timeout is occuring with some probability each time the reconstructor matrix is read, this might make the errors occur sooner than running at 900Hz.
The test will be run using the ARGOS
calibration source, with optimize gain, gopt, and daytime disturbances enabled.
To test sending set_z0 with different banks, in a normal terminal:
python run_ncpa.py --count 720 --delay 5 --bank 0 # 0 for both, 1 for A, 2 for B
The run_ncpa.py script loops a number of times determined by "--count", calling a custom set_z0 routine that is in the script's local directory which either writes to Bank A, Bank B, or both banks. The custom routine is the exact same as set_z0.pro, except that it has an optional BANK argument.
This test might take up to 3 hours, but will probably take much less time if any of the tests cause switch BCU timeouts. The steps for the test plan are in the 2022-05-10 set_z0 test spreadsheet
2022-05-10 set_z0 test spreadsheet
A log of the test is here: 20220511_DX_Day_2
Doug completed the test and running the script with both banks with delay=5s and delay=3s didn't result in any switch BCU timeouts.
- 14 Apr 2022