Problems with Anka plugin and pre-exit


#1

Curious if you can help me understand what’s happening:

I’m using the anka buildkite plugin and my build is doing the following:

Buildkite environment variables
Setting up plugins
Preparing working directory
Setting up vendored plugins
Running plugin anka command hook
:anka: Pulling high-sierra from Anka Registry
:anka: Cloning high-sierra to high-sierra-41af84bd-ee74-42f0-a2cb-5c867d6128f8
Executing cd /data/job && ./scripts/eosio_build.sh -y in high-sierra-41af84bd-ee74-42f0-a2cb-5c867d6128f8
# Received cancellation signal, interrupting
# ↳ Command completed in 34.806948251s
🚨 Error: The command exited with status -1
Running plugin anka pre-exit hook
:anka: Suspending VM
$ anka suspend high-sierra-41af84bd-ee74-42f0-a2cb-5c867d6128f8
Exited with status -1 (agent lost)

But it doesn’t return anything and simply stops the VM locally (vs suspending it).
I can run the anka suspend command manually for a cloned VM without issue.
I can run a simple task like ls -laht && sleep 100000 and get the same problem.

Here is the pre-exit hook:

#!/bin/bash
set -ueo pipefail

DIR="$(cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd)"

. "$DIR/../lib/env.bash" # Collect job_image_name

echo "--- :anka: Suspending VM"
COMMAND="anka suspend ${job_image_name}"
if [[ "${debug_mode:-off}" =~ (on) ]]; then echo "$ ${COMMAND}" >&2; fi
eval "${COMMAND}"
echo "$job_image_name has been suspended"

Is there something special I have to do in order to support this?


#2

Hey @nathan.pierce! Is the plugin you are using https://github.com/chef/anka-buildkite-plugin? That’s not an official one, might be worth submitting an issue there?


#3

Correct, that’s the plugin. We’re using pre-exit to handle cleanup of failures. The suspend command runs just fine if we do a simple ls command and let it finish without cancelling it. It’s only when we cancel it that this weird behavior happens.

Can you clarify for me what the big differences are between a cancellation and a failure/success in buildkite? The anka suspend logs are saying that the run commands are not fully cancelled and that’s why the suspend is failing. What sort of signal is used on cancel?


#4

Anka support says that they route certain signals into the guest VM to stop what is running. Here are the signals they accept: signals = [1, 2, 3, 10, 12, 15]

Is there a way to see what signal a cancellation is sending?


#5

On cancel the most recent agent version will SIGTERM the bootstrap process, which will catch that and SIGTERM each subprocesses.

What agent version are you running?


#6

Currently on buildkite-agent version 3.9.1, build 2856

Thanks.


#7

Ok, so here is some more information about how the anka plugin and anka are working. We clone the VM, run a command (sleep) and then I submit the cancel from buildkite.

The pre-exit hook sees that and submits the anka suspend command request. The logs show it gets through most of the suspend steps but then abruptly stops:

2019-03-14 10:26:01,601 [anka_manager.py:56]: vm 0fbc1d26-4665-11e9-93b4-00e04c6834c6 is running (PID 82905)
2019-03-14 10:26:01,602 [image.py:52]: /Users/nathan.pierce/Library/Application Support/Veertu/Anka/state_lib/17c642c5-4665-11e9-90b4-00e04c6834c6.ank
2019-03-14 10:26:01,602 [image.py:177]: disk image create command: /Library/Application Support/Veertu/Anka/bin/anka_image create /Users/nathan.pierce/Library/Application Support/Veertu/Anka/state_lib/17c642c5-4665-11e9-90b4-00e04c6834c6.ank 17314086912
2019-03-14 10:26:01,613 [image.py:52]: /Users/nathan.pierce/Library/Application Support/Veertu/Anka/state_lib/17c642c5-4665-11e9-90b4-00e04c6834c6.ank
2019-03-14 10:26:01,614 [anka_vm_process_manager.py:39]: suspending vm
2019-03-14 10:26:01,614 [socket_communicator.py:49]: {"file": "/Users/nathan.pierce/Library/Application Support/Veertu/Anka/state_lib/17c642c5-4665-11e9-90b4-00e04c6834c6.ank"}
2019-03-14 10:26:01,615 [socket_communicator.py:55]: send 123 bytes
2019-03-14 10:26:01,616 [socket_communicator.py:72]: {u'status': u'OK', u'message': u'suspended'}
2019-03-14 10:26:01,616 [anka_vm_process_manager.py:41]: {u'status': u'OK', u'message': u'suspended'}
2019-03-14 10:26:01,617 [anka_manager.py:197]: com.veertu.ankahv.0fbc1d26-4665-11e9-93b4-00e04c6834c6: waiting for task termination

The above anka logs show it is waiting for the TERM that was sent to finish killing off all active processes on the host machine. It can take ~20 seconds sometimes. However, within 5 seconds buildkite throws Exited with status -1 (agent lost) instead.

If I test this locally, without buildkite, I see the suspend working just fine:

2019-03-14 17:30:20,516 [anka_manager.py:227]: com.veertu.ankahv.f55f5f00-3d05-11e9-bc0b-9801a79c2f33: waiting for task termination
2019-03-14 17:30:20,622 [anka_config_types.py:537]: creating state file image name 48205cdcbee24d75964de35163be8286.ank
2019-03-14 17:30:20,622 [anka_config_types.py:542]: image path: /Users/boris/Library/Application Support/Veertu/Anka/state_lib/48205cdcbee24d75964de35163be8286.ank
2019-03-14 17:30:20,622 [vm.py:697]: f55f5f00-3d05-11e9-bc0b-9801a79c2f33: writing to disk
2019-03-14 17:30:20,623 [process_lock.py:33]: lock_wait /Users/boris/Library/Application Support/Veertu/Anka/vm_lib/.locks/list_lock: locking
2019-03-14 17:30:20,623 [process_lock.py:40]: lock_wait /Users/boris/Library/Application Support/Veertu/Anka/vm_lib/.locks/list_lock: locked
2019-03-14 17:30:20,634 [process_lock.py:43]: free_lock /Users/boris/Library/Application Support/Veertu/Anka/vm_lib/.locks/list_lock
2019-03-14 17:30:20,635 [vm.py:219]: deleting communication socket
2019-03-14 17:30:20,636 [process_lock.py:43]: free_lock /Users/boris/Library/Application Support/Veertu/Anka/vm_lib/.locks/f55f5f00-3d05-11e9-bc0b-9801a79c2f33

It seems as if buildkite is premature with thinking the suspend command is done. Can you help me figure out what could be causing this?

Update: I used a trap cleanup TERM instead of the pre-exit and the same thing happened.


#8

What is your cancel-grace-period? What happens if you increase it?


#9

That was it. I was unaware that was an option and it works great when set to one minute.