Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows: docker system prune not reclaiming expected space#31253

Closed
pennywisdom opened this issue Feb 22, 2017 · 28 comments · Fixed by #36728
Closed

windows: docker system prune not reclaiming expected space #31253

pennywisdom opened this issue Feb 22, 2017 · 28 comments · Fixed by #36728
Assignees
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. platform/windows version/1.13

Comments

@pennywisdom
Copy link

Description

I am running a CI process that builds images on a windows 2016 host Virtual Machine. A scheduled job runs every 4 hours to clear up space, but this does not seem to be reclaiming the space that docker is suggesting. The amount reclaimed has been decreasing.

Currently I have a 60Gb virtual machine that is running server core and when I run docker system prune -fa i get the following output

TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 0 0 0 B 0 B
Containers 0 0 0 B 0 B
Local Volumes 0 0 0 B 0 B

However running a scan of C:\ProgramData\docker I can see that there are many Gb's of files here, especially under windows filter. With a scan in progress, where docker is reporting zero images I currently have:

C:\ProgramData\docker - 20.5Gb - 188398 files - 66838 directories
C:\ProgramData\docker\windowsfilter - 20.4Gb - 145451 files - 66752 directories

If i run docker images and docker ps -a these are both empty.

There seems to be a gradual decline in the space that is reclaimed, like dangling images are not being detected and not picked up.

One thing to note is that I am trying to delete all images to free up space; then on my next builds I am pulling in windowsservercore or nanoserver from the build of the images, i am not explicitly pulling windowsservercore or nanoserver.

Output of docker version:

docker version
Client:
 Version:      1.13.1
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 08:47:51 2017
 OS/Arch:      windows/amd64

Server:
 Version:      1.13.1
 API version:  1.26 (minimum version 1.24)
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 08:47:51 2017
 OS/Arch:      windows/amd64
 Experimental: false

Output of docker info:

docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 1.13.1
Storage Driver: windowsfilter
 Windows:
Logging Driver: json-file
Plugins:
 Volume: local
 Network: l2bridge l2tunnel nat null overlay transparent
Swarm: inactive
Default Isolation: process
Kernel Version: 10.0 14393 (14393.693.amd64fre.rs1_release.161220-1747)
Operating System: Windows Server 2016 Standard
OSType: windows
Architecture: x86_64
CPUs: 4
Total Memory: 3.997 GiB
Name: winch03
ID: 6QNU:V7HZ:D7DG:RTJH:SIQ2:BKRE:WBDD:UHBB:RIVA:3RXB:PI4R:TAX2
Docker Root Dir: C:\ProgramData\docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: -1
 Goroutines: 22
 System Time: 2017-02-22T12:04:10.5920341Z
 EventsListeners: 0
Username: xxxx
Registry: https://index.docker.io/v1/
Labels:
 hosttype=windows
 dc=ashford
 hostrole=dev
Experimental: false
Insecure Registries:
 192.168.xxx.xx:5000
 localhost:5000
 127.0.0.0/8
Live Restore Enabled: false

Running windows server 2016 on Hyper-V, fully patched with latest Windows Updates/

Expected Outcome
As I am deleting all images I would expect the amount of space reclaimed to be consistent and not reducing over time to a point where very little is reclaimed. It seems like there are dangling or orphaned images that are remaining that are not being detected.

@pennywisdom pennywisdom changed the title windows image prune not reclaiming space windows: docker system prune not reclaiming space Feb 22, 2017
@pennywisdom pennywisdom changed the title windows: docker system prune not reclaiming space windows: docker system prune not reclaiming expected space Feb 22, 2017
@pennywisdom
Copy link
Author

A further discovery is that I have found over 4Gb of old docker-builder* files and directories under C:\Windows\Temp.

These will be related to building the images, but I would hope these were cleaned up as well.

@lowenna lowenna self-assigned this Mar 10, 2017
@ekitagawa
Copy link

ekitagawa commented Jun 5, 2017

I am seeing similar issues too. It seems some exited containers are skipped by docker system prune or docker ps -a.

Here is how I verified the issue.

  1. stop all running containers
  2. make sure docker ps -a returns none
  3. run docker system prune and “y”
  4. go to C:\ProgramData\docker\windowsfilter

On my test image (which I’ve been using only for several weeks) the directory has 11 left over folders. One of them contains ...\Files\inetpub, so this seems to be left over from an exited IIS container.

Let me know if you need any other information.

Docker version 17.03.1-ee-3, build 3fcee33

@thaJeztah
Copy link
Member

ping @johnstep @jstarks PTAL

@georgyturevich
Copy link

georgyturevich commented Jun 8, 2017

Hello there,

We also see similar behavior.

I found that "windowsfilter" directory contains a lot of garbage on our production server.

I made a copy of this server, stopped and removed all containers, performed "docker system prune -a". And after that, this directory still contains 4204 sub-directories with size more then 800Gb. See following output

PS C:\Users\Administrator> docker images -a 
REPOSITORY TAG IMAGE ID CREATED SIZE 
PS C:\Users\Administrator> docker ps -a 
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 
PS C:\Users\Administrator> docker images 
REPOSITORY TAG IMAGE ID CREATED SIZE 
PS C:\Users\Administrator> docker info 
Containers: 0 
Running: 0 
Paused: 0 
Stopped: 0 
Images: 0 
Server Version: 17.03.1-ee-3 
Storage Driver: windowsfilter 
Windows: 
Logging Driver: json-file 
Plugins: 
Volume: local 
Network: l2bridge l2tunnel nat null overlay transparent 
Swarm: inactive 
Default Isolation: process 
Kernel Version: 10.0 14393 (14393.693.amd64fre.rs1_release.161220-1747) 
Operating System: Windows Server 2016 Datacenter 
OSType: windows 
Architecture: x86_64 
CPUs: 64 
Total Memory: 1.906 TiB 
Name: WIN-B7...skiped 
ID: 62EZ:...skipped
Docker Root Dir: E:\docker_storage_1_13 
Debug Mode (client): false 
Debug Mode (server): true 
File Descriptors: -1 
Goroutines: 23 
System Time: 2017-06-04T17:51:45.8250234Z 
EventsListeners: 0 
Username: aureadockerservice 
Registry: https://index.docker.io/v1/ 
Experimental: false 
Insecure Registries: 
127.0.0.0/8 
Live Restore Enabled: false 
PS C:\Users\Administrator> (dir E:\docker_storage_1_13\windowsfilter\).count 
4204 
PS C:\Users\Administrator> 

I assumed that it can be result of performing docker rm -f ... command as sometimes just docker rm fails with Access denied error and our developers have to use force flag to delete container.

But looks like it is common issue, as Eiichi and Alex were able to reproduce it.

@friism
Copy link
Contributor

friism commented Jun 8, 2017

ping @PatrickLang

@thaJeztah thaJeztah added the kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. label Jun 8, 2017
@jstarks
Copy link
Contributor

jstarks commented Jun 8, 2017

@darrenstahlmsft, is this a known issue?

@darstahl
Copy link
Contributor

darstahl commented Jun 8, 2017

I've taken a look at this previously, but have not found the root cause. The container and image configs (c:\programdata\docker\containers and c:\programdata\docker\image) are still on disk even though they are not visible from the API, which is why prune thinks there is nothing available to clean up.

I'm not able to find a reliable way to cause the leaks, but it seems that some failure path doesn't remove the configs from disk correctly (and they are never reloaded into the daemon, even after a restart).

docker rm -f would cause this (and would be expected I think) as the -f flag removes tracking from Docker regardless of if it can clean up the backing image.

This is on my backlog of things to look at when I have time.

@thaJeztah
Copy link
Member

@darrenstahlmsft would this PR (part of 17.06) help with this #31012 ? (basically, if deleting fails, do not remove the container, so that it is still visible in docker ps -a)

@ekitagawa
Copy link

While prevention is important, some guidance on manual clean up would be helpful too. We have customers running Windows containers in production already.

For example, is there any way to take a look at the folder and determine if it's safe to delete? I could write a KB article as a temp solution.

@pennywisdom
Copy link
Author

@ekitagawa this tool is good -> https://github.com/jhowardmsft/docker-ci-zap
But use with caution. Its completely destructive but does clean up all the space as far as I found. Not an ideal fit for production environments though (unless you have some good orchestration going on that will allow you to migrate all containers to a new server whilst you clean up). Perfectly fine for CI servers etc when used correctly.

@friism
Copy link
Contributor

friism commented Jul 11, 2017

@darrenstahlmsft have you found out more about this? Also, do you think #31012 might help address this?

@ekitagawa
Copy link

@jhowardmsft or @darrenstahlmsft Any idea when you can come back to this?

@lowenna
Copy link
Member

lowenna commented Jul 11, 2017

Once LCOW is done.....

@ekitagawa
Copy link

ok. i will ping you again then.

@darstahl
Copy link
Contributor

darstahl commented Jul 11, 2017

#31012 certainly should help in a lot of cases and make tracking this down easier, but as John said, both of our available cycles are going to be totally consumed by containerd and LCOW work. I'll continue to watch out for this as I work on other things, but don't expect to be able to dedicate the necessary time in the near future. I suspect some ref counting issues in the graphdriver are at fault for some parts of this, which others could feel free to look for in the mean time.

@friism
Copy link
Contributor

friism commented Jul 27, 2017

@pennywisdom if it's an option for you, can you try 17.06.1-ee release clients to see if #31012 helps?

They're available here:

Here's approximately how to update:

$ProgressPreference = "SilentlyContinue"
Invoke-WebRequest -Uri "<url>" -OutFile "docker.zip"

Stop-Service docker
Remove-Item -Force -Recurse $env:ProgramFiles\docker
Expand-Archive -Path "docker.zip" -DestinationPath $env:ProgramFiles -Force
Remove-Item docker.zip

Start-Service docker

@pennywisdom
Copy link
Author

Hi @friism just got around to testing docker-17.06.1-ce-rc4.zip and the results are a lot better. There are a couple of bits of feedback::

  1. the stats seem a little off, but this may be to do with orphaned artifacts not being applied. e.g. when i ran docker system df before pruning the stats showed that i had 17.09GB reclaimable. After I ran docker system prune -fa then get-volume closer to 35Gb of space had been cleaned up. Not a huge problem by slightly confusing.
  2. After running docker system prune -fa i ran docker-zap docker-ci-zap.exe -folder C:\ProgramData\docker which found and recovered ~2Gb more space. However this is deleting more that just the image data so could be something in that. The reason I ran this after is that this was the only reliable way to previously reclaim free space and I wanted to see if it was still recovering significantly more space than the system prune.

@friism
Copy link
Contributor

friism commented Aug 14, 2017

Thanks for the update @pennywisdom!

@georgyturevich
Copy link

@friism @darrenstahlmsft Hello Michael, Darren,
After installing 17.10 and performing docker system prune -af --volumes inside a clone of production server I still have 3015 sub-directories in windowsfilter directory with about 925 Gb of data.

Can we think about some separate Powershell/Go automation of cleaning this directories? I can test it inside clone test servers before applying it on production.

@darstahl
Copy link
Contributor

I just submitted a PR that I think solves this issue. If anyone who sees this regularly could verify that would be great. I've not leaked a single layer locally since applying my fix.

@thaJeztah
Copy link
Member

Link to PR; #36728

@vasicvuk
Copy link

vasicvuk commented Jun 6, 2018

The issue still exists on 18.03.0-ce on Windows 10.

@thaJeztah
Copy link
Member

@vasicvuk could you open a new issue with details, and steps to reproduce/verify the bug?

@thaJeztah
Copy link
Member

@vasicvuk actually; make sure you're on 18.03.1-ce, because the fix was included in the 18.03.1 patch release; see docker-archive/docker-ce#508

@Cloudmersive
Copy link

This issue should be reopened as it was never fixed @thaJeztah and it is a very serious issue

@Cloudmersive
Copy link

Continues to occur in 19.03.1 and is very easy to reproduce

@AndyHughes
Copy link

Yes I've got the issue too. I had 100Gig tied up in build cache and did docker system prune and it now shows 0Gig in build cache, but no hard drive space is freed up. My drive is 'full'. Agree with Cloudmersive, this is v serious issue

@mythz
Copy link

mythz commented Dec 28, 2020

I've resolved this issue by manually recompacting the WSL2 ext4.vhdx size:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. platform/windows version/1.13
Projects
None yet
Development

Successfully merging a pull request may close this issue.