⛰️OpsIncidents02-11-2024 - Git And Ci Outage

WIP DOCUMENT

todo:

  • does turtle reboot tonight (in logs it will be 8:15UTC on feb 12 and should boot into kernel 6.1.0-18-amd64

    • it did! and it had a different outage as a result but the autossh issue was not part of it

  • hurricane electric - did they have a dns outage this morning (feb 10/11)

    • they did not

  • autossh retry logic & document in comments in systemd unit file

    • i forgot to explain in the unit file comments i think, adding to todo list

TIMELINE

  • 2024/02/11 03:30:27EST: tree reboots - autoupdate

  • 2024/02/11 03:30:28EST: autossh fails with

    • ssh: Could not resolve hostname refrigerator.hup.is: Temporary failure in name resolution

    • note: there is no other retry until reboot at 9:05

  • 2024/02/11 03:35EST: updown sends email announcing that git.bunk.computer and ci.bunk.computer are unavailalble

  • 2024/02/11 09:00EST: maren wakes up and sees the emails and is then very awake

  • 2024/02/11 09:01EST-09:38EST: maren ssh's into all the servers and reboots everything a lot

    • at some point, and with no modifications to running configs, autossh connection starts working again

    • at some point, maren changed autossh to target IP of refrigerator instead of resolving refrigerator.hup.is

    • maren sees in the logs that the actual failure was Temporary failure in name resolution

  • 2024/02/11 09:05:37EST: maren reboots tree (why?)

  • 2024/02/11 09:39EST: outage resolved, maren posted in discord to say so

  • 2024/02/11 LATER: we realized that it it's possible that this was caused by the autossh service trying to do a dns lookup before systemd-resolved is up, we investigated that

  • 2024/02/11 FINAL: dns lookup/resolved conflict idea was correct, fixed in unit file


troubleshooting systemd unit

network-online.target vs network.target: https://systemd.io/NETWORK_ONLINE/

network-online.target disabled by default on our servers (default debian bookworm):

root@turtle:~# systemctl is-enabled systemd-networkd-wait-online.service
disabled

stack overflow suggests targeting nss-lookup.target directly

further reading on network-online.target indicates that waiting for it in a unit file can severely delay the service starting. it's recommended by systemd itself to avoid wanting that target unless you really need it and truly understand its behavior. systemd instead just recommends making your thing actually resilient to the unreliable nature of networks

IN CONCLUSION: instead of deeply understanding what systemd's default definition of "the network is up" is, we want to make our service retry for a while using this strategy: https://stackoverflow.com/a/39284869

  • DONE