Search code examples
linuxazureazure-web-app-serviceazure-container-serviceazure-web-app-for-containers

Unable to delete files in shared filesystem


During a deployment of a Linux App for Containers today, the app started failing and never came up. Investigating the logs in Kudu, I could see the application was failing to run because during the installation of dependencies, the program would crash trying to delete a file.

Attempting to delete the files manually, it continues to crash:

/home/site/wwwroot>ls -la libs/lxml
total 6868
drwxrwxrwx 2 nobody nogroup    4096 Oct 28 01:13 .
drwxrwxrwx 2 nobody nogroup   16384 Oct 28 01:23 ..
-rwxrwxrwx 1 nobody nogroup  304689 Oct 27 20:09 _elementpath.cpython-36m-x86_64-linux-gnu.so
-rwxrwxrwx 1 nobody nogroup 6704624 Oct 27 20:09 etree.cpython-36m-x86_64-linux-gnu.so
/home/site/wwwroot>rm -Rf libs
rm: cannot remove 'libs/lxml': Directory not empty
rm: cannot remove 'libs/newrelic/core': Directory not empty
rm: cannot remove 'libs/newrelic/packages/wrapt': Directory not empty

/home/site/wwwroot>rm -R libs
rm: cannot remove 'libs/lxml/etree.cpython-36m-x86_64-linux-gnu.so': No such file or directory
rm: cannot remove 'libs/lxml/_elementpath.cpython-36m-x86_64-linux-gnu.so': No such file or directory
rm: cannot remove 'libs/newrelic/core/_thread_utilization.cpython-36m-x86_64-linux-gnu.so': No such file or directory
rm: cannot remove 'libs/newrelic/packages/wrapt/_wrappers.cpython-36m-x86_64-linux-gnu.so': No such file or directory

I've 'stopped' the app, but the files continue to be undeleteable.

Short of deleting and recreating the app, what options do I have to get the app running again?

Edit: I tried using rm -rf instead as suggested, but since -r and -R are the same option, there's no difference:

/home/site/wwwroot>ls -la libs
total 16
drwxrwxrwx 2 nobody nogroup 16384 Oct 28 01:23 .
drwxrwxrwx 2 nobody nogroup     0 Sep 10 03:51 ..
drwxrwxrwx 2 nobody nogroup     0 Oct 28 01:13 lxml
drwxrwxrwx 2 nobody nogroup     0 Oct 28 01:13 newrelic
/home/site/wwwroot>rm -rf libs
rm: cannot remove 'libs/lxml': Directory not empty
rm: cannot remove 'libs/newrelic/core': Directory not empty
rm: cannot remove 'libs/newrelic/packages/wrapt': Directory not empty

/home/site/wwwroot>rm -rf libs
rm: cannot remove 'libs/lxml': Directory not empty
rm: cannot remove 'libs/newrelic/core': Directory not empty
rm: cannot remove 'libs/newrelic/packages/wrapt': Directory not empty

I can't use the SSH option because I'm using python:3 as the container (no Azure customization).

I did at one point (on this app) attempt to use a container customized for Azure the source for which is here. All that container does is add an additional step of starting an SSH service during app startup, so it seems unlikely implicated in the current failure.

Edit: I've updated the app to use the jaraco/python-azure container (and fixed a bug in that container). I was able to SSH to the app container for a short time, in which I tried installing lsof, but before that command completed, the SSH connection showed disconnected, I suspect because the docker container is exiting due to the inability to delete files.

I've since been unable to reconnect via SSH, as I'm getting internal server errors from the webssh endpoint:

internal server error in webssh

I tried using a different Startup File for the container: init_container.sh bash -c \"sleep 300\", so that it might spin up for 5 minutes while I ssh to it, but even when I did that, I couldn't SSH to it and I only received 503 errors from the webssh endpoint, even though in the diagnostic console, I can see it starting the docker image with the appropriate commands.

I also tried updating the Startup File to init_container.sh rm -rf /home/site/wwwroot/libs/*, but using the Diagnostic Console, I see the same error is occurring in the app container:

2017-10-31 02:36:40.629 INFO - Issuing docker pull: imagename =jaraco/python-azure:latest
2017-10-31 02:36:40.668 INFO - Issuing docker pull: imagename =jaraco/python-azure:latest 
2017-10-31 02:36:40.709 INFO - Issuing docker pull jaraco/python-azure:latest 
2017-10-31 02:36:41.835 INFO - docker pull returned STDOUT>> latest: Pulling from jaraco/python-azure
Digest: sha256:589b1150b8b5893662a9dc7d0919e577cb2a95fcb0524fd1fffd7e5d8122b261
Status: Image is up to date for jaraco/python-azure:latest 
2017-10-31 02:36:41.855 INFO - Starting container for site 
2017-10-31 02:36:41.856 INFO - docker run -d -p 28374:80 --name APPNAME-dev_0 -e PORT=80 -e WEBSITE_SITE_NAME=APPNAME-dev -e WEBSITE_AUTH_ENABLED=False -e WEBSITE_ROLE_INSTANCE_ID=0 -e WEBSITE_INSTANCE_ID=110c23d861dcaa09836ed00f278d29dc4b913a207c2d9dd4ed54366e3c2f6a3a -e HTTP_LOGGING_ENABLED=1 jaraco/python-azure:latest init_container.sh rm -rf /home/site/wwwroot/libs/*

2017-10-31 02:36:47.946 INFO - Container logs 
2017-10-31T02:36:42.675769119Z Starting OpenBSD Secure Shell server: sshd. 
2017-10-31T02:36:44.736417871Z rm: cannot remove ‘/home/site/wwwroot/libs/lxml’: Directory not empty
2017-10-31T02:36:45.596986651Z rm: cannot remove ‘/home/site/wwwroot/libs/newrelic/core’: Directory not empty
2017-10-31T02:36:45.649171980Z rm: cannot remove ‘/home/site/wwwroot/libs/newrelic/packages/wrapt’: Directory not empty
2017-10-31 02:36:47.947 ERROR - Container APPNAME-dev_0 for site APPNAME-dev has exited, failing site start

I'm losing hope. Any other options?

Edit: Changing the App Service Plan from S1 to S2, making a request to the service (to trigger a move), and then switching the app back to S1 cleared up the problem, but only temporarily. When later in the week week there was renewed traffic to the service, it worked for a short while and then started failing again with Service Unavailable. Inspecting the logs, the same error was back. During startup, the application attempts to delete those files, but because those files are apparently in use, the deletion and subsequent startup steps fail. Worse is that changing the App Service Plan, while it seemed to correct the issue last week seems not to be a sufficient workaround this time. Moreover, resizing the App Service Plan, while effective, also has unintended side effects, like taking offline other apps in that service plan.

I suspect that some implementation detail about the shared file system (mounted at /home) causes open files to be locked and thus unable to be deleted by the deployment process or another instance startup or manually.

I'm pretty sure my only option is not to use the shared file system for any files that might be held open by the app (such as shared libraries).

Edit: In an attempt to minimally replicate the issue, I've created this web app and deployed it here. It is currently running fine. I expect after leaving it idle for some time, it will be flushed and a subsequent request will trigger it to run again and it will fail. I'll report back if it's effective or not.

Edit: I've been unsuccessful in replicating the issue in a new webapp. I've tried leaving the app idle for 24 hours to see if that would trigger the issue. I've also tried explicitly downgrading the 'newrelic' dependency (which contains one of the .so shared libraries), and starting and stopping the webapp to trigger the 'run' script again. But no matter what I do, the app starts up fine. I'm now thinking I should just wipe and rebuild my failing production app and see if the problem goes away.


Solution

  • It seems that it's a design limitation of Azure Web Apps. Any files in the shared file system held open by the application (even just for read) will not be writable or delete-able. The only option is to re-engineer the app to store such files somewhere other than the shared file system.

    I suspect this issue is exacerbated by the shared file system being hosted on Windows. On a Unix system, a file can typically be removed even if it's open by another process. So for users of Web Apps For Containers, it's an extra surprise that files cannot be deleted, and thus they simply linger without an error.