Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't use canceled context to send KILL signal to healthcheck process#43739

Merged
merged 1 commit into from
Aug 23, 2022

Conversation

ndeloof
Copy link
Contributor

@ndeloof ndeloof commented Jun 23, 2022

- What I did
send KILL signal to healthcheck process when the context is canceled (timeout)

- How I did it
Use a background context to send KILL signal

- How to verify it
see reproduction scenario on #43737

- Description for the changelog
fix healthcheck process termination on timeout

- A picture of a cute animal (not mandatory but encouraged)
image

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@thaJeztah
Copy link
Member

do we want a test for this? Or too complicated to set one up?

@thaJeztah thaJeztah added this to the 22.06.0 milestone Jun 23, 2022
@thaJeztah
Copy link
Member

thaJeztah commented Jun 23, 2022

I don't think the failure was related, so I kicked CI again, but posting here in case there's a flaky test at hand that we're not tracking yet;

=== RUN   TestDockerSwarmSuite/TestAPISwarmServicesMultipleAgents
    docker_api_swarm_service_test.go:100: [d7f46e9e54ee1] joining swarm manager [d451e1c460ace]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2478
    docker_api_swarm_service_test.go:101: [d285f6cd0aebd] joining swarm manager [d451e1c460ace]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2479
    docker_api_swarm_service_test.go:122: timeout hit after 30s: output: "4adc199e5240\n13ecfed699f5\n", output: "5547cb8d033b\n97e5710958e7\nb9aa8d4c5a86\n3440eeecbb60\n8fabde17fa91\n"
    poll.go:125: assertion failed: error is not nil: exit status 1
    check_test.go:611: [d7f46e9e54ee1] daemon is not started
    --- FAIL: TestDockerSwarmSuite/TestAPISwarmServicesMultipleAgents (60.05s)

edit: already known as flaky; #23626

cpuguy83
cpuguy83 previously requested changes Jun 23, 2022
integration/container/health_test.go Outdated Show resolved Hide resolved
@tianon tianon requested a review from corhere June 25, 2022 00:03
@ndeloof ndeloof force-pushed the healthcheck_timeout branch 9 times, most recently from 0100d64 to 25a6430 Compare June 27, 2022 15:12
integration/container/testdata/health/go.mod Outdated Show resolved Hide resolved
integration/container/testdata/health/main.go Outdated Show resolved Hide resolved
integration/container/health_test.go Outdated Show resolved Hide resolved
@thaJeztah
Copy link
Member

Something failing;

=== RUN   TestHealthCheckProcessKilled
    health_test.go:111: timeout hit after 10s: waiting for a new container healthcheck
--- FAIL: TestHealthCheckProcessKilled (11.40s)

daemon/health.go Outdated Show resolved Hide resolved
@ndeloof ndeloof force-pushed the healthcheck_timeout branch 2 times, most recently from f937722 to 250d005 Compare July 4, 2022 13:19
corhere
corhere previously requested changes Jul 4, 2022
daemon/exec.go Outdated Show resolved Hide resolved
daemon/exec.go Outdated Show resolved Hide resolved
daemon/exec.go Outdated Show resolved Hide resolved
@ndeloof ndeloof force-pushed the healthcheck_timeout branch 2 times, most recently from 5d3979e to 291dcc4 Compare July 5, 2022 13:07
@thaJeztah
Copy link
Member

This is weird; it's as if Windows is running an older commit before the quote was fixed;

=== FAIL: github.com/docker/docker/integration/container TestHealthCheckProcessKilled (unknown)
    health_test.go:111: polling check failed: expected "Health check exceeded timeout (50ms)", got "60': line 1: syntax error: unterminated quoted string\n"
coverage: [no statements]
panic: test timed out after 1h0m0s

At least, I don't see anything wrong looking at this 🤔;

Test:     []string{"CMD-SHELL", `sh -c 'sleep 60'`},

@thaJeztah
Copy link
Member

I guess CMD-SHELL already runs with a shell, so we can probably omit the sh -c 🤔

@thaJeztah
Copy link
Member

thaJeztah commented Aug 19, 2022

failure on Windows

github.com/docker/docker/integration/container TestHealthCheckProcessKilled (16.06s)
    main_test.go:32: assertion failed: error is not nil: Error response from daemon: Could not kill running container 8147d85d217100444d4d50c351608743d6448fd8f9b200cc1a9491a4d83a1b32, cannot remove - tried to kill container, but did not receive an exit event: failed to remove 8147d85d217100444d4d50c351608743d6448fd8f9b200cc1a9491a4d83a1b32

Comment on lines 113 to 114
err := apiClient.ContainerKill(ctx, cID, "SIGKILL")
assert.NilError(t, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try if this helps with the cleanup step on Windows 🤞

@thaJeztah
Copy link
Member

#43739 (comment)

Let's try if this helps with the cleanup step on Windows 🤞

Unfortunately; that didn't help, only shifting / repeating the problem;

=== FAIL: github.com/docker/docker/integration/container TestHealthCheckProcessKilled (27.95s)
    health_test.go:114: assertion failed: error is not nil: Error response from daemon: Cannot kill container: d5e1f84f4dbb83aac2623d972592089575ed8a3917e9d8ddaa487b423938c461: tried to kill container, but did not receive an exit event
    main_test.go:32: assertion failed: error is not nil: Error response from daemon: Could not kill running container d5e1f84f4dbb83aac2623d972592089575ed8a3917e9d8ddaa487b423938c461, cannot remove - tried to kill container, but did not receive an exit event: failed to remove d5e1f84f4dbb83aac2623d972592089575ed8a3917e9d8ddaa487b423938c461

So it looks like on Windows, the health check causes an issue with stopping the container (?). Of course the question is; was that already the case (but not tested before?), or is there an issue in the implementation here? (Perhaps an issue with signal handling / incorrect signals for Windows, idk)


cID := container.Run(ctx, t, apiClient, func(c *container.TestContainerConfig) {
c.Config.Healthcheck = &containertypes.HealthConfig{
Test: []string{"CMD", "sleep", "60"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other attempt; trying CMD ["sleep", "60"] instead of starting sleep in a shell.

@thaJeztah
Copy link
Member

thaJeztah commented Aug 20, 2022

Getting more interesting now; running the command without shell looks to have fixed some instances of Windows. Possibly because Windows may be running a different version of BusyBox (ISTR older versions of BusyBox didn't propagate signals, and newer versions do).

However, on Windows 2019 and Windows 2022 running with containerd, TestHealthCheckProcessKilled times out; wondering if this could be related to something like containerd/containerd#5067

=== Failed
=== FAIL: github.com/docker/docker/integration/container TestHealthCheckProcessKilled (unknown)
coverage: [no statements]
panic: test timed out after 1h0m0s

goroutine 199 [running]:
testing.(*M).startAlarm.func1()
	C:/hostedtoolcache/windows/go/1.18.5/x64/src/testing/testing.go:2029 +0x8e
created by time.goFunc
	C:/hostedtoolcache/windows/go/1.18.5/x64/src/time/sleep.go:176 +0x32
	...


cID := container.Run(ctx, t, apiClient, func(c *container.TestContainerConfig) {
c.Config.Healthcheck = &containertypes.HealthConfig{
Test: []string{"CMD-SHELL", `sh -c "sleep 60"`},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grmbl. back to the original issue on Windows on this one;

health_test.go:111: polling check failed: expected "Health check exceeded timeout (50ms)", got "60\": line 0: syntax error: unterminated quoted string\n"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant CI logs: builtin, containerd-shim-runhcs-v1

In both cases, the CommandLine string ultimately passed into the Win32API CreateProcess is

cmd /S /C "sh -c \"sleep 60\""

After lots of research and experimentation on a Windows box... my head hurts. Here's what I know so far:

  • A Windows process' command and arguments are passed in as a single string, which is the string specified in the CommandLine passed into CreateProcess.
  • Each process parses its own command line. Not all processes use the same algorithm.
    • As a corollary,cmd.exe does not tokenize its subprocess' command lines for them.
  • cmd.exe /S /C follows some arcane rules, but the gist is that everything following the /C is taken literally as a command line, with one exception: if the first non-whitespace character following the /C is a double-quote character, the command line is modified by dropping that character and the last double-quote character in the remaining command-line.

Putting that all together:
cmd /S /C "sh -c \"sleep 60\"" is equivalent to interactively running

C:\> sh -c \"sleep 60\"

Executing that command results in CreateProcess() being called with CommandLine set to sh -c \"sleep 60\".

The Windows C Runtime parses the CommandLine sh -c \"sleep 60\" into the argv

argv[0] = sh
argv[1] = -c
argv[2] = \"sleep
argv[3] = 60\"

and busybox ash gets very confused.

The ultimate issue is that cmdProbe is incorrectly escaping the shell command line in a manner which the Windows C runtime would parse as a single argument, expecting cmd.exe /S /C to act like sh -c and consume the next argv as its argument. But cmd.exe does not unescape the argument.

containerd/containerd#5067 is very related to the above, but I don't think it has to do with the timeouts you saw. (My guess is that the timeouts have something to do with sleep.exe not existing.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall I tested various combinations; any of them would fail on at least one combination 😞

I tried running sleep as main process itself;

[]string{"CMD", "sleep",  "60"},

Which would be the equivalent of the default command we set for "sleep" containers;

func sleepCommandForDaemonPlatform() []string {
if testEnv.OSType == "windows" {
return []string{"sleep", "240"}
}

Tried running it as child process of sh;

[]string{"CMD", "sh", "-c", "sleep 60"},

Tried passing it as single string, but I think Linux wasn't a fan of that

[]string{"CMD-SHELL", `sh -c "sleep 60`},

And the original one (which (like above), and if I'm not mistaken, would run sh as child process of cmd, so that didn't feel right);

[]string{"CMD-SHELL", "sh", "-c", "sleep 60"},

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I wonder is if the Busybox image should also be updated to set the SHELL to /bin/sh; currently it doesn't, which means that (in Dockerfiles), the default shell would still be cmd or powershell (depending on the base image)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also wondering if exec.Config needs to have a CommandLine property added; it just occurred to me that we have that for the container's main process, but don't have it for exec 🤔

Args []string `json:"args,omitempty"`
// CommandLine specifies the full command line for the application to execute on Windows.
CommandLine string `json:"commandLine,omitempty" platform:"windows"`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I wonder is if the Busybox image should also be updated to set the SHELL to /bin/sh

Also wondering if exec.Config needs to have a CommandLine property added

Both excellent ideas I wholeheartedly support!

@corhere corhere force-pushed the healthcheck_timeout branch 2 times, most recently from 64c7eab to eb75fb8 Compare August 23, 2022 04:11
@thaJeztah
Copy link
Member

windows 2022 / containerd;

=== FAIL: github.com/docker/docker/integration/container TestHealthCheckProcessKilled (15.43s)
    main_test.go:32: assertion failed: error is not nil: Error response from daemon: Could not kill running container d4a736c61017160c76188113089560575c359d42db1ff5b6e2e18838002a985d, cannot remove - tried to kill container, but did not receive an exit event: failed to remove d4a736c61017160c76188113089560575c359d42db1ff5b6e2e18838002a985d

Terminating the exec process when the context is canceled has been
broken since Docker v17.11 so nobody has been able to depend upon that
behaviour in five years of releases. We are thus free from backwards-
compatibility constraints.

Co-authored-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Co-authored-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Signed-off-by: Cory Snider <csnider@mirantis.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@@ -93,6 +94,42 @@ while true; do sleep 1; done
poll.WaitOn(t, pollForHealthStatus(ctxPoll, client, id, "healthy"), poll.WithDelay(100*time.Millisecond))
}

// TestHealthCheckProcessKilled verifies that health-checks exec get killed on time-out.
func TestHealthCheckProcessKilled(t *testing.T) {
skip.If(t, testEnv.RuntimeIsWindowsContainerd(), "FIXME: Broken on Windows + containerd combination")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a skip for now.

@thaJeztah
Copy link
Member

Some unrelated failures; not sure I've seen this one before, but doesn't look related at least (kicked CI once more)

=== FAIL: github.com/docker/docker/integration-cli TestDockerCLIPsSuite/TestPsListContainersFilterID (19.48s)
    check_test.go:402: assertion failed: error is not nil: Error response from daemon: Could not kill running container a617f7b8cf70de3b92c7e740bcd129fba04c00d86457b93fb434379dc9b7cae5, cannot remove - tried to kill container, but did not receive an exit event: failed to remove a617f7b8cf70de3b92c7e740bcd129fba04c00d86457b93fb434379dc9b7cae5
    --- FAIL: TestDockerCLIPsSuite/TestPsListContainersFilterID (19.48s)

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's greeeen!

LGTM

@thaJeztah
Copy link
Member

Alright, let's get this one in

@thaJeztah thaJeztah dismissed cpuguy83’s stale review August 23, 2022 20:00

Looks like this was addressed, so dismissing this one.

@thaJeztah thaJeztah merged commit 6b7974c into moby:master Aug 23, 2022
@thaJeztah thaJeztah modified the milestones: 22.06.0, v-next Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dockerd doesn't kill healthcheck processes after timeout
7 participants