upcoming - adsec - housekeeper: install new housekeeper that continues logs telemetry after a switch bcu timeout

  • Software versions used:
    • dxadsec: latest
    • soul-sxwfs: latest
  • Software version after test:
    • sxadsec: latest
    • soul-sxwfs: latest

Background

To better debug the switch BCU timeouts and overcurrent events, we discussed modifying the Housekeeper to continue logging telemetry from the DSPs while the Switch BCU is timing out. This is discussed on Github in ao-supervisor issue #43. Currently, the switch BCU telemetry variables are read first, and if there is a timeout, no further telemetry variables will be read.

This update modifies the Housekeeper so that it handles timeouts and other exceptions for reading switch BCU variables separately from DSP variables. If there is a timeout reading any variable, the Housekeeper will skip writing the line that includes the variable to the telemetry and try again in 1 second on the next iteration. If there is a timeout reading from the switch BCU, the Housekeeper will skip reading all future switch BCU variables during that iteration so that it doesn't get stuck continually trying to read variables and timing out on every read command.

Doug and I tested the new housekeeper during the daytime on 2022-03-29. Since then, after reviewing the code with Xianyu, I made some minor changes and one major change that the Housekeeper will stop reading from the switch BCU until the next iteration if there is a timeout, as mentioned above. This doesn't impact the results of the previous daytime test and I tested it in a VM by simulating a switch BCU timeout return code. I believe then that the Housekeeper should be ready for checkout during an engineering night, after which we can keep it running and monitor it in the event of a switch BCU timeout.

Code changed

The pull request for the new housekeeper (#65) has a list of the code that has changed.

Checkout procedure during an engineering night

Test checklist spreadsheet.

Pending discussion at an AO SW meeting, we can merge the new housekeeper pull request and switch to a new version of the ao-supervisor that includes the new Housekeeper during an engineering night. We can repeat the test scripts from the daytime test on 2022-03-29 just to be sure, which ensure that all the correct telemetry variables are being written, and we can double-check that the values reported make sense (see the results output on the last two pages of the test log.) To keep the ao-supervisor software as similar as possible between both sides, we can update both sxadsec and dxadsec.

If there are no problems, we can leave the "latest" version running and I will upgrade and switch back to "stable" the next morning. I will also update the source directories on the WFS machines but won't bother rebuilding them, as this only affects the adsec.

Otherwise, if there are problems, to revert either sxadsec or dxadsec, run the following on each machine:

adsc_stop && use_soul stable && adsc_start && adsc_check

Results

-- BrandonMechtley - 21 Jun 2022
Topic revision: r1 - 22 Jun 2022, BrandonMechtley
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback