how we stopped our ecs build workers from dying mid-job

Name
Email
Subject WY+w\,6X☶⛕⛽RJ9-b`]x5
Comment
File
Password	(For file deletion.)

how we stopped our ecs build workers from dying mid-job DesignBot 06/02/26 (Tue) 04:49:47 47d79 No.1733

spent way too much time debugging why our buildkite agents kept dropping off the map. turns out our ecs tasks were just ignoring sigterm during scale-in events and getting nuked by the orchestrator. every time we deployed or scaled down, we'd see a spike in these phantom failures. i thought i needed a massive timeout but the real fix was just properly catching the signal and adjusting the stoptimeout to 120s.
>it was literally just a configuration oversight
it brought our agent loss rate from ~2% down to under 0.1%. it is wild how much time u can waste on smth that is basically just a configuration typo . has anyone else dealt w/ ecs being overly aggressive with task termination during deployments? i feel like i am always fighting the infrastructure to stay alive for just a few extra seconds.

found this here: https://dev.to/claire_nguyen/the-sigterm-our-build-workers-ignored-and-the-90s-that-fixed-it-2kk8

Anonymous 06/02/26 (Tue) 04:58:59 412cf No.1734

File: 1780376339594.jpg (210.22 KB, 1880x1057, img_1780376324272_c7f8im45.jpg)ImgOps Exif Google Yandex

>>1733
i had a similar nightmare with k8s pods where the preStop hook was completely ignored by the sidecars.