⛰️OpsIncidents02-12-2024 - Turtle Outage

WIP DOCUMENT

turtle lost its static ip address (allocated by the AT&T router) the next day after the autossh outage

confusingly autossh service did not retry infinitely to reestablish connection. logs:

Feb 11 20:05:06 turtle autossh[528837]: ssh child pid is 528840
Feb 12 08:15:00 turtle autossh[528837]: received signal to exit (15)
Feb 12 08:15:00 turtle systemd[1]: Stopping autossh.service - Keeps a tunnel to refrigerator open on localhost 10000...
Feb 12 08:15:00 turtle systemd[1]: autossh.service: Main process exited, code=exited, status=1/FAILURE
Feb 12 08:15:00 turtle systemd[1]: autossh.service: Failed with result 'exit-code'.
Feb 12 08:15:00 turtle systemd[1]: Stopped autossh.service - Keeps a tunnel to refrigerator open on localhost 10000.
Feb 12 08:15:00 turtle systemd[1]: autossh.service: Consumed 47.174s CPU time.
-- Boot 5fd0d9a015e5443db344eca75673cd3e --
Feb 12 08:15:29 turtle systemd[1]: Started autossh.service - Keeps a tunnel to refrigerator open on localhost 10000.
Feb 12 08:15:29 turtle autossh[517]: port set to 0, monitoring disabled
Feb 12 08:15:29 turtle autossh[517]: starting ssh (count 1)
Feb 12 08:15:30 turtle systemd[1]: autossh.service: Main process exited, code=exited, status=1/FAILURE
Feb 12 08:15:30 turtle autossh[531]: ssh: Could not resolve hostname refrigerator.hup.is: Name or service not known
Feb 12 08:15:29 turtle autossh[517]: ssh child pid is 531
Feb 12 08:15:30 turtle systemd[1]: autossh.service: Failed with result 'exit-code'.
Feb 12 08:15:30 turtle autossh[517]: ssh exited prematurely with status 255; autossh exiting
Feb 12 08:15:31 turtle systemd[1]: autossh.service: Scheduled restart job, restart counter is at 1.
Feb 12 08:15:31 turtle systemd[1]: Stopped autossh.service - Keeps a tunnel to refrigerator open on localhost 10000.
Feb 12 08:15:31 turtle systemd[1]: Started autossh.service - Keeps a tunnel to refrigerator open on localhost 10000.
Feb 12 08:15:31 turtle autossh[583]: port set to 0, monitoring disabled
Feb 12 08:15:31 turtle autossh[583]: starting ssh (count 1)
Feb 12 08:15:31 turtle autossh[583]: ssh child pid is 587
Feb 12 08:15:31 turtle autossh[587]: ssh: Could not resolve hostname refrigerator.hup.is: Name or service not known
Feb 12 08:15:31 turtle autossh[583]: ssh exited prematurely with status 255; autossh exiting
Feb 12 08:15:31 turtle systemd[1]: autossh.service: Main process exited, code=exited, status=1/FAILURE
Feb 12 08:15:31 turtle systemd[1]: autossh.service: Failed with result 'exit-code'.
Feb 12 08:15:32 turtle systemd[1]: autossh.service: Scheduled restart job, restart counter is at 2.
Feb 12 08:15:32 turtle systemd[1]: Stopped autossh.service - Keeps a tunnel to refrigerator open on localhost 10000.
Feb 12 08:15:32 turtle systemd[1]: Started autossh.service - Keeps a tunnel to refrigerator open on localhost 10000.
Feb 12 08:15:32 turtle autossh[729]: port set to 0, monitoring disabled
Feb 12 08:15:32 turtle autossh[729]: starting ssh (count 1)
Feb 12 08:15:32 turtle autossh[729]: ssh child pid is 732
Feb 12 08:15:32 turtle autossh[732]: ssh: Could not resolve hostname refrigerator.hup.is: Name or service not known
Feb 12 08:15:32 turtle systemd[1]: autossh.service: Main process exited, code=exited, status=1/FAILURE
Feb 12 08:15:32 turtle autossh[729]: ssh exited prematurely with status 255; autossh exiting
Feb 12 08:15:32 turtle systemd[1]: autossh.service: Failed with result 'exit-code'.
Feb 12 08:15:33 turtle systemd[1]: autossh.service: Scheduled restart job, restart counter is at 3.
Feb 12 08:15:33 turtle systemd[1]: Stopped autossh.service - Keeps a tunnel to refrigerator open on localhost 10000.
Feb 12 08:15:33 turtle systemd[1]: Started autossh.service - Keeps a tunnel to refrigerator open on localhost 10000.
Feb 12 08:15:33 turtle autossh[733]: port set to 0, monitoring disabled
Feb 12 08:15:33 turtle autossh[733]: starting ssh (count 1)
Feb 12 08:15:33 turtle autossh[733]: ssh child pid is 736
Feb 12 08:15:33 turtle autossh[736]: ssh: Could not resolve hostname refrigerator.hup.is: Name or service not known
Feb 12 08:15:33 turtle autossh[733]: ssh exited prematurely with status 255; autossh exiting
Feb 12 08:15:33 turtle systemd[1]: autossh.service: Main process exited, code=exited, status=1/FAILURE
Feb 12 08:15:33 turtle systemd[1]: autossh.service: Failed with result 'exit-code'.
Feb 12 08:15:34 turtle systemd[1]: autossh.service: Scheduled restart job, restart counter is at 4.
Feb 12 08:15:34 turtle systemd[1]: Stopped autossh.service - Keeps a tunnel to refrigerator open on localhost 10000.
Feb 12 08:15:34 turtle systemd[1]: Started autossh.service - Keeps a tunnel to refrigerator open on localhost 10000.
Feb 12 08:15:34 turtle autossh[753]: port set to 0, monitoring disabled
Feb 12 08:15:34 turtle autossh[753]: starting ssh (count 1)
Feb 12 08:15:34 turtle autossh[753]: ssh child pid is 757
Feb 12 13:03:44 turtle autossh[753]: received signal to exit (15)
Feb 12 13:03:44 turtle systemd[1]: Stopping autossh.service - Keeps a tunnel to refrigerator open on localhost 10000...
Feb 12 13:03:44 turtle systemd[1]: autossh.service: Main process exited, code=exited, status=1/FAILURE
Feb 12 13:03:44 turtle systemd[1]: autossh.service: Failed with result 'exit-code'.
Feb 12 13:03:44 turtle systemd[1]: Stopped autossh.service - Keeps a tunnel to refrigerator open on localhost 10000.
Feb 12 13:03:44 turtle systemd[1]: autossh.service: Consumed 16.849s CPU time.
-- Boot a0388e42e7144ec9befe6c1c637f8b13 --

OH! looking at the logs, it did manage to establish a connection on the fourth try. that's why it stopped retrying. the outage was not because of autossh at all (we knew that), it was purely because turtle lost its allocated IP address on the router.

i don't have time to write more this morning because i have work. here's reminder notes for later

  • i was messing with IPs over the weekend for the farm testing, and did notice that turtle wasn't showing up in the ip allocation UI anymore. i figured this was borky at&t software, clearly it wasn't. no idea how it got to this state

  • to mitigate, i just went into the ui and reassigned the ip. this is something that currently you have to be physically at the internet location to do (possibly we can change that)

  • does this usher in the world where we buy our own networking gear? i would like to delay if possible but this outage is hard to argue with