I'm trying to download some web pages through pywebcopy. I use this library as it clones exactly same, however, it tries to download every file from the web page. As a result, sometimes it gets stuck at some file and goes to infinite loop, I guess. (I never waited more than 10 minutes.) In fact, it downloads what I want to download, which is the complete web page. So, I want to terminate its process when the file is once downloaded and go for the other web pages in a loop.
I would do it with while
but the folder structure is too nested. And as the folder doesn't exist before library downloads them, I couldn't make a search with os.path
.
The folder structure is like this:
main_folder├───subfolder1───some_folder1
│ └───some_folder2
│ some_image.png
│
│
└───subfolder2
└───sub_subfolder1
└───sub_subfolder2
└───sub_subfolder3
└───sub_subfolder4
└───sub_subfolder5
│ index.html
│ some.pwc
│
└───amp
the_file_I_want.pwc
The file I need is always in amp folder. So, basically I should find that folder and check if the file is there. However the names of sub_subfolder3, sub_subfolder4 and sub_subfolder5 changes for each web page. I have to search with wildcard which is something like: "main_folder/subfolder2/**/amp/*.pwc". But the folder doesn't exist before downloading start.
what I want to do is something like this:
from pywebcopy import save_webpage
import glob
...
pattern = 'main_folder/subfolder2/**/amp/*.pwc'
while glob.glob(pattern).is_file() = False:
save_webpage(url, download_folder, **kwargs)
It's an invalid syntax but this is what exactly I want. I've searched but couldn't come up with any solution. Any help would be highly appreciated.
Try this:
while any(os.path.isfile(i) for i in glob.iglob(pattern)):