-
Notifications
You must be signed in to change notification settings - Fork 88
Code cleanup and null checks #373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds defensive null checks and improves code quality in the AsyncEventSource module. The changes focus on preventing potential null pointer dereferences and improving the handling of disconnected clients.
Changes:
- Added null checks for
_clientpointer in_queueMessageand_runQueuemethods - Refactored
avgPacketsWaiting()to use a ternary operator for cleaner division-by-zero prevention - Added
connected()checks insend()and_adjust_inflight_window()to skip disconnected clients - Improved
_adjust_inflight_window()to use connected client count instead of total client count
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re this one and #370: there's a recurring pattern/problem in this codebase regarding the ownership and life cycle of 'SomeClient' type objects when they're also attached to a "SomeServer" type object that wants to operate on a list of them. In the original AsyncServer->AsyncWebRequest->AsyncWebResponse architecture, the Request and Response objects are "self-owned" and managed by their AsyncClient's life cycle. But with AsyncWebSocketClient and AsyncEventClient, we must have a second reference to these objects in a list in their associated Source type, making cleanup of these objects somewhat more complex: they cannot simply be deleted when their AsyncClient terminates, as they might be in use by another task; and that other task has to still interlock with the fact that the AsyncClient has been invalidated. (AsyncTCP handles this gracefully, as it's now interlocked internally, but I haven't personally reviewed the older ESP8266 ESPAsyncTCP; perhaps I should.)
I think we probably want to consider building a common data structure to manage this pattern across the different use cases in this library -- something that allows us to safely do for(client& c: clients) c->write() without getting in to trouble. (Also: the major feature I added the WLED AWS fork is to recycle that pattern for AsyncWebRequest as well, so as to support a request queue to manage heap load.) I spent some time on looking at this but I don't yet have a coherent API suggestion. I think we can do some magic with std::shared_ptr/std::weak_ptr and making a temporary stack copy of the list contents during iteration to minimize a lot the friction.
| ++hits; | ||
| } else { | ||
| ++miss; | ||
| if (c->connected()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need a client-pointer-is-not-null check here? (and all the other c->connected() pattern usages?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR at least because c is the unique ptr of a AsyncEventSourceClient object wrapping the _client pointer. c->connected() si checking for a null _client ptr behind. c is not supposed to be null.
This PR is just some code cleanup and null checks.
I would rather discuss how to correctly protect the iteration over the AsyncEventSourceClient list in #370 which was opened for that goal.
Yes, that's typically the kind of effort I wanted to initiate through #370, which is not to review yet: it was just an attempt to prove that the source of the issue @zekageri saw was the modification of the list during its iteration. Bu there were also some missing NPE checks. Let's focus on this cleanup PR first and then we will see how to correctly solve this problem. |
|
I have removed the cleanup call entirely and tested the whole day but so far no crash |
Thanks a lot for being proactive! That's what I was going to ask you... to test this branch. I saw some missing null checks so i was hoping they would solve the issue you saw. So that's perfect then! |
Ah wait! I think I misunderstood... Can you please test this PR branch ? It only has some null checks but still has the list erase doing a mutation in the disconnect handler. If this branch fixes your issue then is was a missing null check. If this branch still has the bug, then this is the list mutation that is problematic. Thanks 🙏 |
|
Oh okay, will test it tomorrow |
|
Once there is a final version, I'm happy to retest on 8266 as well. Feel free to ping me here or reach out on discord (same handle) |
@bdraco : this PR is final and rebased on top of main which also includes your previous PR. So it would be nice of you to test this branch also to have your go for the merge 👍 Note: we can merge this PR even if @zekageri still see the issue on ESP32 (regarding the list mutation), because IF there is still an issue regarding list mutation on ESP32, we will fix in another PR. |
|
Great, I'll test it after we finish with the ESPHome patch release |
|
testing passes on the reproducer. trying ESPHome now |
|
can not longer repro crash on ESPHome with heavy reloads (tested with this pr + other fixes since main isn't merge in here) diff --git a/esphome/components/web_server_base/__init__.py b/esphome/components/web_server_base/__init__.py
index d5d75b395d..4be653c362 100644
--- a/esphome/components/web_server_base/__init__.py
+++ b/esphome/components/web_server_base/__init__.py
@@ -48,4 +48,5 @@ async def to_code(config):
if CORE.is_libretiny:
CORE.add_platformio_option("lib_ignore", ["ESPAsyncTCP", "RPAsyncTCP"])
# https://github.com/ESP32Async/ESPAsyncWebServer/blob/main/library.json
- cg.add_library("ESP32Async/ESPAsyncWebServer", "3.7.10")
+ # Testing PR #370 for ESP8266 SSE crash fix
+ cg.add_library("https://github.com/bdraco/ESPAsyncWebServer.git#pr-370", None)Pages of on the console though. but thats expected. Might be nice to track drop messages and only log every so often than x messages were discarded to reduce the serial blocking, but a nice to have for sure |
|
tested rtlxxx with libretiny as well. All good |
|
ESPHome PR memory impact analysis esphome/esphome#13467 (comment) |
Yes we know. Logging is something we would like to improved and have a way to collect stats instead of logging like that. FYI @me-no-dev - you came with this idea last time also |
|
Thanks a lot for your testing @bdraco ! I will ask the team for approval / review. Do you need a release just after for your upgrade ? We can tackle the esp32 issue in a next PR / release. |
|
A release in the next week would be great timing as it gives time to bake in 2026.2.x-dev for 3-4 weeks and for the ESP8266 users that test dev to report issues |
|
@me-no-dev @vortigont @willmmiles : if you can have a look and approve : merge. Thank you! |
|
I don't see the problem anymore. Strange since I have removed the cleanup call as well. Maybe it was another issue I had with TLS access. Sorry if it was a false report. |
I have added some null checks in this area so maybe that helped. |
Adds some null checks to avoid any usage of _client pointer some loop is in progress calling send() and a disconnect event arrives at the same time (disconnect event frees the _client pointer)