upcoming: adsec - AnemomterMon failed connections: test manual connections to Moxa

  • Hardware used: connected to DX anemometer Moxa while the DX adsec was removed and no processes were running on dxadsec.
  • Software versions used:
    • (vm): stable, only AnemometerMon and msgdrtdb need to run
    • (vm): SOUL-anemometermon-catch-aoexception, only AnemometerMon and msgdrtdb need to run
  • SW: Brandon

Background

ao-supervisor issue #144 describes a frequent problem where AnemometerMon will repeatedly fail to connect to the Moxa/anemometer until the Moxa is reset. AdamHousekeeper will then trigger TSS if the most recent wind speed value was received from AnemometerMon too long ago. This has been recorded in 5 ITs, potentially for different reasons, which are listed in the issue description. PR 143 fixes a bug in Issue #46 where the TCP socket is not closed for every type of connection failure, so subsequent re-connections will probably fail because the Moxa will think the client (sxadsec or dxadsec) is still connected and only allows one client to be connected at a time. From the logs, I can see that the one exception type that does delete the connection has never been raised (see my comment in the pull request for details), so the connection has never been deleted on connection failures, which include failures to configure the device after successfully establishing the TCP connection.

Code changed (PR 143)

  1. The TCP connection is deleted and the connection flag is toggled off if there is any type of exception caught on AnemometerMon::Connect() or while requesting and reading a frame in Anemometer::Periodic(). It will be re-created the next time Connect() is called. Currently, this only happens if one specific type of exception is caught, which is not the type raised by TCP connection errors or errors configuring the anemometer.
  2. AnemometerMon::Connect() now properly catches AOExceptions, which include TCPExceptions (e.g. timeouts or connection refused) and the AOExceptions raised by Anemometer::setup() (errors configuring the anemometer) and will log the reason for the connection failure, which should help future debugging.

Tests

Here's some debugging I can do. I can run this from a virtual machine, but I will need to temporarily change the Moxa configuration:

Step Result
Reproduce the problem with the current AnemometerMon.
1. Run the current msgdrtdb and AnemometerMon while the SX ASM is removed. Watch the anemometermon.L.* logs. OK.
2. Go to http://192.168.18.163/ and reset the Moxa ("Save/Restart"), which should interrupt the connection. OK.
3. Verify that the AnemometerMon log reports "Failed to connect" after an error reading the frame with no description of the failure. If it doesn't, skip to #11, as I need to find a different way to reproduce the problem. OK.
4. Verify that the AnemometerMon continues trying to re-connect but fails. If it doesn't, skip to #11, as I need to find a different way to reproduce the problem. This didn't happen. I couldn't reproduce the problem.
Try increasing the maximum number of connections.
5. Kill AnemometerMon. N/A
6. Reset the Moxa to clear the old connection if there is one. N/A
7. Increase the maximum number of connections to 2 under "Operating Settings -> Port 1." N/A
8. Start AnemometerMon. N/A
9. Reset the Moxa again to interrupt the connection. N/A
10. Check the log to see if this still results in re-connection failures or if the re-connection succeeds. If it succeeds, check the Moxa configuration page to see if the first connection is removed when it does connect. N/A
Try the new AnemometerMon.
11. Kill AnemometerMon. OK.
12. Set the maximum number of connections back to 1. N/A
13. Reset the Moxa to clear the old connection if there is one. OK.
14. Start the new AnemometerMon process. OK.
15. Reset the Moxa again. OK.
16. In the log, verify that the AnemometerMon either fails to connect and reports the reason or that it successfully reconnects. OK.
See if the TCP alive check time works in TCP server mode.
The "TCP alive check time" under "Operating Settings -> Port 1" in the Moxa configuration is set to 7 minutes. This will automatically disconnect a host that hasn't responded in 7 minutes. However, it's unclear whether this parameter is only used for TCP Client Mode (Moxa sends data to host) or for both client mode and server mode, which is what we are using (clients request data from Moxa and Moxa replies). If the TCP alive check timeis used in server mode, it's possible that the re-connection attempts are resetting the timer, since on 2022-06-24, the connection had been failing for at least 30 minutes.
17. Kill AnemometerMon. N/A
18. Reset the Moxa to clear the old connection if there is one. N/A
19. Decreasing the TCP alive check time to 1 minute. N/A
20. Start AnemometerMon (new or old, or a manual telnet session to 192.168.18.164) N/A
21. Kill AnemometerMon with a hard kill -9. and N/A
22. Wait to see if the Moxa clears the entry from the list of connected clients after the specified amount of time. N/A
23. Set the TCP alive check time back to 7 minutes. N/A
All done.
24. Kill AnemometerMon and msgdrtdb. OK.
25. Reset the Moxa and make sure TCP alive check time = 7 minutes and the maximum number of connections = 1. OK.
Summary: I wasn't able to reproduce the problem and will need to do more tests. This patch does add more descriptive logging, though, so we can go ahead and include it.

Logs:

[Started old AnemometerMon ]
AnemometerMon_2330 |INF| 1|2022-08-23 19:19:48.411790| MAIN > Starting thrdlib. Server is @ 127.0.0.1
AnemometerMon_2330 |INF| 2|2022-08-23 19:19:48.418138| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_2330 |INF| 3|2022-08-23 19:19:53.156338| MAIN > Connected.
AnemometerMon_2330 |INF| 4|2022-08-23 19:19:53.156433| MAIN > Started.

[ Reset Moxa]
AnemometerMon_2330 |INF| 5|2022-08-23 19:20:06.830806| MAIN > Retrying reading frame (timeout (1605 > 1000 ms) - -5001: Timeout error)
AnemometerMon_2330 |INF| 6|2022-08-23 19:20:06.930203| MAIN > Failed to read frame. (missing parameters)
AnemometerMon_2330 |INF| 7|2022-08-23 19:20:06.930581| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_2330 |INF| 8|2022-08-23 19:20:07.755312| MAIN > Failed to connect to device
AnemometerMon_2330 |INF| 9|2022-08-23 19:20:11.016038| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_2330 |INF| 10|2022-08-23 19:20:23.240208| MAIN > Connected.

[Ctrl-C]
AnemometerMon_2330 |INF| 11|2022-08-23 19:20:27.767137| MAIN > Signal 2 received. It's timeToDie.
AnemometerMon_2330 |INF| 12|2022-08-23 19:20:27.824524| MAIN > Periodic loop stopped

[Started old AnemometerMon ]
AnemometerMon_3271 |INF| 1|2022-08-23 19:24:31.660726| MAIN > Starting thrdlib. Server is @ 127.0.0.1
AnemometerMon_3271 |INF| 2|2022-08-23 19:24:31.666893| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_3271 |INF| 3|2022-08-23 19:24:36.408751| MAIN > Connected.
AnemometerMon_3271 |INF| 4|2022-08-23 19:24:36.408815| MAIN > Started.

[Reset Moxa]
AnemometerMon_3271 |INF| 5|2022-08-23 19:24:44.695761| MAIN > Retrying reading frame (Error receiving data (errno=104, error message means: Connection reset by peer, received Bytes = -1 ), socket closed. - -15003: TcpConnection receive error)
AnemometerMon_3271 |WAR| 6|2022-08-23 19:24:44.695820| TCPCONNECTION > TcpConnection: error in select(...)
AnemometerMon_3271 |INF| 7|2022-08-23 19:24:44.695862| MAIN > Failed to read frame. (UNKNOWN)
AnemometerMon_3271 |INF| 8|2022-08-23 19:24:44.695932| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_3271 |ERR| 9|2022-08-23 19:24:45.489352| TCPCONNECTION > TcpConnection creation failed
AnemometerMon_3271 |INF| 10|2022-08-23 19:24:45.489526| MAIN > Failed to connect to device
AnemometerMon_3271 |INF| 11|2022-08-23 19:24:49.011393| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_3271 |INF| 12|2022-08-23 19:25:00.755459| MAIN > Connected.

[Ctrl-C]
AnemometerMon_3271 |INF| 13|2022-08-23 19:25:05.004138| MAIN > Signal 2 received. It's timeToDie.
AnemometerMon_3271 |INF| 14|2022-08-23 19:25:05.044768| MAIN > Periodic loop stopped

[Started new AnemometerMon ]
AnemometerMon_3885 |INF| 1|2022-08-23 21:30:11.041768| MAIN > Starting thrdlib. Server is @ 127.0.0.1
AnemometerMon_3885 |INF| 2|2022-08-23 21:30:11.047627| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_3885 |INF| 3|2022-08-23 21:30:15.796288| MAIN > Connected.
AnemometerMon_3885 |INF| 4|2022-08-23 21:30:15.796334| MAIN > Started.

[Reset Moxa]
AnemometerMon_3885 |INF| 5|2022-08-23 21:31:18.583465| MAIN > Retrying reading frame ([TcpReceiveException] Error receiving data (errno=104, error message means: Connection reset by peer, received Bytes = -1 ), socket closed. (code -15003) TcpConnection receive error File TcpConnection.cpp line 133)
AnemometerMon_3885 |WAR| 6|2022-08-23 21:31:18.583614| TCPCONNECTION > TcpConnection: error in select(...)
AnemometerMon_3885 |INF| 7|2022-08-23 21:31:18.583735| MAIN > Failed to read frame. ([TcpReceiveException] Error receiving data (code -15003) TcpConnection receive error File TcpConnection.cpp line 113)
AnemometerMon_3885 |INF| 8|2022-08-23 21:31:18.584111| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_3885 |ERR| 9|2022-08-23 21:31:18.987561| TCPCONNECTION > TcpConnection creation failed
AnemometerMon_3885 |INF| 10|2022-08-23 21:31:18.987767| MAIN > Failed to connect to device ([TcpCreationException] Impossible to connect to 192.168.18.163:4001 (code -15001) TcpConnection creation error File TcpConnection.cpp line 44)
AnemometerMon_3885 |INF| 11|2022-08-23 21:31:23.003279| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_3885 |INF| 12|2022-08-23 21:31:34.762020| MAIN > Connected.

[Ctrl-C]
AnemometerMon_3885 |INF| 13|2022-08-23 21:31:36.944229| MAIN > Signal 2 received. It's timeToDie.
AnemometerMon_3885 |INF| 14|2022-08-23 21:31:36.956489| MAIN > Periodic loop stopped

[Started old AnemometerMon ]
AnemometerMon_4431 |INF| 1|2022-08-23 21:37:08.685916| MAIN > Starting thrdlib. Server is @ 127.0.0.1
AnemometerMon_4431 |INF| 2|2022-08-23 21:37:08.692135| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_4431 |INF| 3|2022-08-23 21:37:13.428178| MAIN > Connected.
AnemometerMon_4431 |INF| 4|2022-08-23 21:37:13.428260| MAIN > Started.

[Killed msgdrtb]
AnemometerMon_4431 |FAT| 5|2022-08-23 21:40:36.220482| MAIN > [AOVarException] Error writing variable (code -5001) Timeout error File RTDBvar.cpp line 496Variable: ANEM.L.CKSPEED (type REAL_VARIABLE, len 1)
AnemometerMon_4431 |FAT| 5|2022-08-23 21:40:36.220482| MAIN > .

[Ctrl-C]
AnemometerMon_4431 |INF| 6|2022-08-23 21:44:04.033200| MAIN > Signal 2 received. It's timeToDie.
AnemometerMon_4431 |INF| 7|2022-08-23 21:44:04.060288| MAIN > Periodic loop stopped

[Started old AnemometerMon ]
AnemometerMon_27592|INF| 1|2022-08-23 23:48:40.870882| MAIN > Starting thrdlib. Server is @ 127.0.0.1
AnemometerMon_27592|INF| 2|2022-08-23 23:48:40.877212| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_27592|INF| 3|2022-08-23 23:48:45.629269| MAIN > Connected.
AnemometerMon_27592|INF| 4|2022-08-23 23:48:45.690148| MAIN > Started.

[Reset Moxa]
AnemometerMon_27592|WAR| 5|2022-08-23 23:49:57.205976| TCPCONNECTION > TcpConnection: timeout 5000 ms expired
AnemometerMon_27592|INF| 6|2022-08-23 23:49:57.206182| MAIN > Retrying reading frame (Timeout expired - -15003: TcpConnection receive error)
AnemometerMon_27592|WAR| 7|2022-08-23 23:50:02.206371| TCPCONNECTION > TcpConnection: timeout 5000 ms expired
AnemometerMon_27592|INF| 8|2022-08-23 23:50:02.206597| MAIN > Failed to read frame. (UNKNOWN)
AnemometerMon_27592|INF| 9|2022-08-23 23:50:02.207276| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_27592|INF| 10|2022-08-23 23:50:13.942489| MAIN > Connected.

[Crl-C]
AnemometerMon_27592|INF| 11|2022-08-23 23:51:55.702154| MAIN > Signal 2 received. It's timeToDie.
AnemometerMon_27592|INF| 12|2022-08-23 23:51:55.777705| MAIN > Periodic loop stopped

[Started old AnemometerMon ]
AnemometerMon_27887|INF| 1|2022-08-23 23:52:31.020291| MAIN > Starting thrdlib. Server is @ 127.0.0.1
AnemometerMon_27887|INF| 2|2022-08-23 23:52:31.026751| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_27887|INF| 3|2022-08-23 23:52:35.756704| MAIN > Connected.
AnemometerMon_27887|INF| 4|2022-08-23 23:52:35.756801| MAIN > Started.

[Reset Moxa]
AnemometerMon_27887|WAR| 5|2022-08-23 23:52:58.397312| TCPCONNECTION > TcpConnection: timeout 5000 ms expired
AnemometerMon_27887|INF| 6|2022-08-23 23:52:58.397795| MAIN > Retrying reading frame (Timeout expired - -15003: TcpConnection receive error)
AnemometerMon_27887|WAR| 7|2022-08-23 23:53:03.403054| TCPCONNECTION > TcpConnection: timeout 5000 ms expired
AnemometerMon_27887|INF| 8|2022-08-23 23:53:03.403324| MAIN > Failed to read frame. (UNKNOWN)
AnemometerMon_27887|INF| 9|2022-08-23 23:53:03.403790| MAIN > Connecting to anemometer (192.168.18.163:4001)
AnemometerMon_27887|INF| 10|2022-08-23 23:53:15.154020| MAIN > Connected.

[Ctrl-C]
AnemometerMon_27887|INF| 11|2022-08-23 23:54:05.952234| MAIN > Signal 2 received. It's timeToDie.
AnemometerMon_27887|INF| 12|2022-08-23 23:54:05.975818| MAIN > Periodic loop stopped

Topic revision: r15 - 07 Sep 2022, BrandonMechtley
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback