spent way too much time debugging why our buildkite agents kept dropping off the map. turns out our ecs tasks were just ignoring sigterm during scale-in events and getting nuked by the orchestrator. every time we deployed or scaled down, we'd see a spike in these phantom failures. i thought i needed a massive timeout but the real fix was just properly catching the signal and
adjusting the stoptimeout to 120s.
>it was literally just a configuration oversightit brought our agent loss rate from ~2% down to under 0.1%. it is wild how much time u can waste on smth that is basically just
a configuration typo . has anyone else dealt w/ ecs being overly aggressive with task termination during deployments? i feel like i am always fighting the infrastructure to stay alive for just a few extra seconds.
found this here:
https://dev.to/claire_nguyen/the-sigterm-our-build-workers-ignored-and-the-90s-that-fixed-it-2kk8