Ok, so here is some more information about how the anka plugin and anka are working. We clone the VM, run a command (sleep) and then I submit the cancel from buildkite.
The pre-exit hook sees that and submits the anka suspend command request. The logs show it gets through most of the suspend steps but then abruptly stops:
2019-03-14 10:26:01,601 [anka_manager.py:56]: vm 0fbc1d26-4665-11e9-93b4-00e04c6834c6 is running (PID 82905)
2019-03-14 10:26:01,602 [image.py:52]: /Users/nathan.pierce/Library/Application Support/Veertu/Anka/state_lib/17c642c5-4665-11e9-90b4-00e04c6834c6.ank
2019-03-14 10:26:01,602 [image.py:177]: disk image create command: /Library/Application Support/Veertu/Anka/bin/anka_image create /Users/nathan.pierce/Library/Application Support/Veertu/Anka/state_lib/17c642c5-4665-11e9-90b4-00e04c6834c6.ank 17314086912
2019-03-14 10:26:01,613 [image.py:52]: /Users/nathan.pierce/Library/Application Support/Veertu/Anka/state_lib/17c642c5-4665-11e9-90b4-00e04c6834c6.ank
2019-03-14 10:26:01,614 [anka_vm_process_manager.py:39]: suspending vm
2019-03-14 10:26:01,614 [socket_communicator.py:49]: {"file": "/Users/nathan.pierce/Library/Application Support/Veertu/Anka/state_lib/17c642c5-4665-11e9-90b4-00e04c6834c6.ank"}
2019-03-14 10:26:01,615 [socket_communicator.py:55]: send 123 bytes
2019-03-14 10:26:01,616 [socket_communicator.py:72]: {u'status': u'OK', u'message': u'suspended'}
2019-03-14 10:26:01,616 [anka_vm_process_manager.py:41]: {u'status': u'OK', u'message': u'suspended'}
2019-03-14 10:26:01,617 [anka_manager.py:197]: com.veertu.ankahv.0fbc1d26-4665-11e9-93b4-00e04c6834c6: waiting for task termination
The above anka logs show it is waiting for the TERM that was sent to finish killing off all active processes on the host machine. It can take ~20 seconds sometimes. However, within 5 seconds buildkite throws Exited with status -1 (agent lost) instead.
If I test this locally, without buildkite, I see the suspend working just fine:
2019-03-14 17:30:20,516 [anka_manager.py:227]: com.veertu.ankahv.f55f5f00-3d05-11e9-bc0b-9801a79c2f33: waiting for task termination
2019-03-14 17:30:20,622 [anka_config_types.py:537]: creating state file image name 48205cdcbee24d75964de35163be8286.ank
2019-03-14 17:30:20,622 [anka_config_types.py:542]: image path: /Users/boris/Library/Application Support/Veertu/Anka/state_lib/48205cdcbee24d75964de35163be8286.ank
2019-03-14 17:30:20,622 [vm.py:697]: f55f5f00-3d05-11e9-bc0b-9801a79c2f33: writing to disk
2019-03-14 17:30:20,623 [process_lock.py:33]: lock_wait /Users/boris/Library/Application Support/Veertu/Anka/vm_lib/.locks/list_lock: locking
2019-03-14 17:30:20,623 [process_lock.py:40]: lock_wait /Users/boris/Library/Application Support/Veertu/Anka/vm_lib/.locks/list_lock: locked
2019-03-14 17:30:20,634 [process_lock.py:43]: free_lock /Users/boris/Library/Application Support/Veertu/Anka/vm_lib/.locks/list_lock
2019-03-14 17:30:20,635 [vm.py:219]: deleting communication socket
2019-03-14 17:30:20,636 [process_lock.py:43]: free_lock /Users/boris/Library/Application Support/Veertu/Anka/vm_lib/.locks/f55f5f00-3d05-11e9-bc0b-9801a79c2f33
It seems as if buildkite is premature with thinking the suspend command is done. Can you help me figure out what could be causing this?
Update: I used a trap cleanup TERM instead of the pre-exit and the same thing happened.