AGw Off Axis Control Housekeeping Software Upgrade

The Engineering group has designed a board which will replace the one-wire file system devices. It also will talk serially to the Moxa (presumably the fourth port), and so we need to modify oacontrol on a unit-by-unit basis to use the new board. Not all the units will be converted at the same time, so oacontrol must support both systems.

The new system has the remote power switch, more temperatures than before, but no humidity sensor. The board will automatically turn off power if temperature thresholds are exceeded.

The board is on whenever the AGw is plugged in, and reads out temperatures, voltages, can power on/off the AGw, and will power down the AGw if a sensor exceeds an Alarm threshold. The board has a serial interface which we will use through the Moxa terminal server in place of the current one-wire system.

There are 10 values being read out, 3 voltages and 7 temperatures, whether an Alarm or Warning threshold has been reached, and the power and flow switch state.

Originally, we expected one AGw to be upgraded every year, with the PFUs being done last. But, the engineering group decided to upgrade 1 through 4 before summer shutdown, 2018. And they will never upgrade the PFUs. This means the OAC must support the onewire system for the PFUs forever.

Software Status as of Summer Shutdown 2018

OAC version 5.0
  • Board is integrated into the application. If the config file has an IP for hkip , that is used instead of the onewire system ( in config as moxaip )
  • onewire and new HK have been decoupled from the getdata RPC program
    • getdata -uN -n command returns the voltages/temps from the new board and if anything is an alarm state
  • the 8051 data and onewire data are both written to HDF5 telemetry
  • stopAGW and startAGW commands are integrated into the new board - with the new board, the behavior of OAC is different - OAC checks if the power up command complained and lets the user know it didn't power up
  • getthresholds -uN command displays the current warning and alarm threshold values
  • setthresholds -uN -a "42.200,41.300,42.550,40.110,41.100,43.800,49.300,1.500,1.500,1.200" will set the alarm thresholds
  • setthresholds -uN -w "38.000,40.500,41.440,39.500,39.800,41.100,48.800" will set the warning thresholds
  • EPICS channel writes are built in to set the temp and voltage values and alarms if the AGW is OFF but was not commanded OFF
  • installed on the mountain, but not released
  • TODO: add the UMAC telemetry that Mike wants (not part of HK)

GCS
  • Updates are in git in 2018A and master branch.
  • Uses getowdata or getstatus calls to OAC depending on configurable flag newHKEnabled in the PUBLIC configuration files.
    The new HK option should probably be in the AGWn.cfg files instead of by instrument. But, the AGWUnit classes are by instrument, so that's where it's logical in the code.
  • Deleted redundant telemetry (temp, ambient, humidity) from the agwunit telemetry stream since it is recorded for all connected AGWs in OAC now, not just for the authorized AGW like GCS would do.

Issues

  • Are there still some race conditions in the OAC? In this example, I did a startAGW, which gave me these log messages immediately:
    2018-08-07 19:43:46.460 oac.tucson.lbto.org oacserver[3859]: LOG_INFO    START AGW4                                      
    2018-08-07 19:43:46.957 oac.tucson.lbto.org oacserver[3859]: LOG_ERR     AGW4 is OFF, but not commanded OFF 
    But, the command timed out:
    [ksummers@remotehost-1 gcs]$ ~/oac/trunk/bin/startAGW -u4
    call failed: RPC: Timed out
    /home/ksummers/oac/trunk/bin/startAGW: init() failed AGW4: RPC Error on oac.tucson.lbto.org
    
    I didn't do a getdata before just starting it again - I'll bet it had started.

  • Got a UMAC error when it really sent the command (saw "HOMING" in the OAC log):
    ksummers@remotehost-1 gcs]$ ~/oac/trunk/bin/home -u4 -m21
    /home/ksummers/oac/trunk/bin/home: home() failed: AGW4: UMAC error on oac.tucson.lbto.org
    I didn't change the behavior of this command, so this is old OAC behavior.

Design/Implementation Questions

  1. How will we flag which AGws have been upgraded?
    Is it too hokey to use moxaip for onewire devices and hkip for new HK board devices? And make it 0.0.0.0 for the address not being used.
  2. Since the sensor data is different (and there is no humidity sensor), should we change the format of the getData command, or add a new command? Either way the GCS will have to be changed.
    Separate command (getowdata vs getstatus).
  3. How do we define the Alarm and Warning thresholds in a configuration file? If we use oacontrol.conf we have to modify the parser. Maybe use a new, simpler configuration file?
    No - it's easy to add to the config file, keep the same format.
  4. Should the new board refuse to power on if the Alarm thresholds have not been set?
    Not a software issue, Dan/Mike can determine that.
  5. Should it clear the alarms on a power off?
    Not a software issue, Dan/Mike can determine that.
  6. Should we run two oacontrol servers? one using OWFS, and one using the new interface. The exisiting server can be told to ignore particular units by specifying an IP address of 0.0.0.0. So the new server could be written just for the new interface, and not have to support both. The GCS would need to be modified to talk to the correct server (use different port numbers).
    But, it's trivial to add another thread to the existing oacserver for HK and the onewire thread will just ignore it since it will have a bogus IP.
  7. The board will stay on even if oacontrol goes away; does that change how the start/stop AGW commands are used?
  8. CB mentioned that Mike wants UMAC telemetry like what we've got in OSS. But, I don't think we should be putting a bunch more queyring in there. Here's what we get so far, is it sufficient to put that into telemetry? (available in the getdata):
    general state:     0x201
    general errors:        0
    THETA state:           0
    THETA errors:          0
    R state:               0
    R errors:              0
    FOC state:           0x3
    FOC errors:            0
    FIL state:           0x3
    FIL errors:            0
    Home Status:        0x14
    Power Status:          1 

    No - it's not sufficient. He wants us to query for the rest of the data. From Mike 20171220:
    No we want much more telemetry concerning the motion control system very similar to what we do for the M3's. If you look at the M3 telemetry this will give you an idea of what we want. You should see things like following error, motor commanded position, actual motor position, amp fault, etc, etc for each axis.

OAC Details (mostly from CB's notes)

The original AIP AGW units used the one-wire file system (OWFS) to control the remote power switch, and to read out temperature and humidity sensors. The one-wire file system maps Linux files with serial devices connected to a Moxa terminal server. The OWFS communicates with the fourth port on the Moxa.

There are four one-wire devices:

UMAC temperature
umac_1w_tempsensor=/lbt/oacontrol/dev/agw4/uncached/28.740B76000000

ambient temperature
ambient_1w_tempsensor=/lbt/oacontrol/dev/agw4/uncached/28.033C42000000

# humidity (represented by a voltage sensor, humidity is calculated
# from that voltage internally):
umac_1w_DAC=/lbt/oacontrol/dev/agw4/uncached/20.7B6A03000000

# PIO for switching on and off UMAC, Camera Controllers and Camera
# Controller PC and for controlling flow of cooling liquid
umac_1w_PIO=/lbt/oacontrol/dev/agw4/uncached/29.9BAA00000000 

There is a thread that reads out the one-wire sensors (startGetOnewireData). We need to replace this thread with one that reads out the new sensors.

Each sensor has two thresholds: Alarm upper and Warning upper - these will be incremental values above ambient. Dan has decided the application only cares about high temps, and maybe not even warnings.
The command to set the alarm thresholds will be SETAL ,,,,,25,, and there will likely be a GETAL command as well.

The remote power sensor is used in the module src/init/init_server.c
Currently, to turn on, it sends a 1 to the pio filename and a 0 to turn off.

and the temperature and humidity reading in module
src/onewire/onewire.c

The humidity sensor is read out in function readoutDS2450()
and this function will no longer be needed (also manageDS2450() ).

The getAGWData module must be modified to use a different data structure (defined in src/agwcomm.h). The getdata client and the GCS both use the getAGWData information. This is a problem, since modifying the GCS introduces complications.

Since the src/onewire/onewire.c module runs a looping thread, it is probably simpler to just keep it the way it is (the name could be changed if desired), and read out the power state and all the new temperatures. The new readout commands will need a network send/receive system, with the usual complications of closed sockets, no connection, etc., although Dan is supposed to have example code that works.

There will be new command(s) to set the temperature thresholds that control automatic power down. Adding a new command is a chore; at least the following must be modified (Linux RPC is used):

   src/oacontrol.c
   src/oacserver.c
   src/"newcommand" 

If it is necessary to add new items in oacontrol.conf, the parser (src/parser/) must be modified.

The Engineering group would like TCS telemetry written for the UMAC. See the OSS for the kind of UMAC information desired. Since the TCS telemetry collection library is in C++, a C++ wrapper needs to be provided. The telemetry could be added at a later time.

We also need to provide input to the Alarm Handler.

Housekeeping Application on the 8051 Board

The application knows whether OAC is connected or not, via the Moxa. There's no need for a heartbeat - when oacserver disconnects, the application will know.

None of this status changes very fast. However, if it has a problem, it will reboot itself. The reboot is fast - oacserver could actually be in between STAT polls when it reboots. It should know why it rebooted. Dan is thinking about how to tell us that.

Here's how to talk to it via the telnet interface:
[oac@oac ~]$ telnet 150.135.245.95 950
Trying 150.135.245.95...
Connected to 150.135.245.95.
Escape character is '^]'.
STAT
OFF,OFF,+25.996,+25.801,+0.000,+0.000,+0.000,+0.000,+28.145,+0.000,+0.000,+11.521
STAT
OFF,OFF,+25.801,+25.606,+0.000,+0.000,+0.000,+0.000,+28.048,+0.000,+0.000,+11.507
POWERON
OK
STAT
ON,OFF,+25.996,+25.606,+0.000,+0.000,+0.000,+0.000,+28.048,+0.000,+0.000,+11.517
STAT
ON,OFF,+26.094,+25.703,+0.000,+0.000,+0.000,+0.000,+28.048,+0.000,+0.000,+11.524
REBOOT

AGW boot loader .....
STAT
OFF,OFF,+25.899,+25.410,+0.000,+0.000,+0.000,+0.000,+27.950,+0.000,+0.000,+11.531

AGW boot loader .....


As of 20171116, Dan added the alarm and warning status info:
STAT
OFF,OFF,A000,W000,+28.634,+28.732,+0.000,+0.000,+0.000,+0.000,+30.685,+0.000,+0.000,+11.510
The "A" and "W" strings will have to be parsed to see what temperature or voltage are potentially in alarm state.


As of 20180213, can see the threshold values for alarms (warnings are ignored right now) and can set?
Have to turn on OVERRIDE to get info when it's not powered up. This is because in real-life it will not allow it to power on if the flow is not on.
[oac@oac oac-new]$ telnet 150.135.245.95 950
Trying 150.135.245.95...
^C
[oac@oac oac-new]$ telnet 150.135.245.95 950
Trying 150.135.245.95...
Connected to 150.135.245.95.
Escape character is '^]'.
STAT
OFF,OFF,A000,W000,+23.750,+23.554,+0.000,+0.000,+0.000,+0.000,+23.848,+0.000,+0.000,+10.829
GETAL
29.300,29.300,0.000,0.000,0.000,0.000,29.300,1.500,1.500,1.200
ERR
OVERRIDE ON
OVERRIDE ON
GETAL
29.300,29.300,0.000,0.000,0.000,0.000,29.300,1.500,1.500,1.200
GETWARN
0.000,0.000,0.000,0.000,0.000,0.000,0.000
ERR


OAC status as of 20180213
The schedule has shifted and they want the software to be ready by mid-March 2018.
What can I do before then, that will be enough for them to test??
  • there is a new getstatus command, the same as the old version getdata which currently returns the new HK data that it's reading
    Need to include power and flow state; eventually add all the extra UMAC data it reads
  • add stopAGWHK and startAGWHK commands which will power down/up the 8051 board
  • add commands for setting and getting alarm thresholds (ignore warning for now)

OAC status as of 20171222
Initially, my test program would only get the first 14 bytes and then timeout on subsequent reads. Dan thought it was because the device is too slow and I needed to code it to be in a while loop, reading and checking for the '\n' delimeter. However, that did the exact same thing.
So, I tried to put in a sleep for a second, after sending the STAT and before trying to read from the socket. That did it - I got the whole string after that and was able to parse it.
Then the test program would connect and read, but would not work to read over and over again - the select comes back with 0 (no descriptors available). It seemed not consistent - seems a couple of times I was able to close and then reopen and get status, but mostly not.
Dan thought maybe I didn't include the \n and that would confuse it. I had not included it, but when I did it seemed to make it worse - the selects were always timing out.
But always the second STAT command in the telnet session would work, and then it would work over and over. So, I added a \n to the beginning of the command and it consistently works.

I merged the HK thread into the oacserver that already had the onewire. The HK thread reads IP address and port from configuration and every 30secs reads the STAT command from that -logs the results and writes a separate agw_N_hk telemetry stream for each AGW.

Created telemetry streams for the onewire data - now that data can be deleted from the agwunit telemetry in GCS.

Created a new getstatus RPC command to replace the getdata that GCS uses for the AIP.
[oac@oac oac-new]$ bin/getstatus -u4
general state:     0x1f9
general errors:    0x399
THETA state:           0
THETA errors:          0
R state:               0
R errors:              0
FOC state:             0
FOC errors:            0
FIL state:             0
FIL errors:            0
Temp1:  27 degC
Temp2:  26 degC
Temp3:  0 degC
Temp4:  0 degC
Temp5:  0 degC
Temp6:  0 degC
Temp7:  28 degC
Voltage1:       0 volt
Voltage2:       0 volt
Voltage3:       11 volt
Home Status:           0
Power Status:          0

If I compile with the SIMULATOR defined, I can do the getdata and getstatus.
Topic revision: r23 - 03 Oct 2018, PetrKubanek
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback