Search code examples
pythonhtmllist

For loop not appending array with regex


The following code fetches version numbers from a URL, and then for each version number, goes to a page for that version number and fills an array with a specific pattern for a filename. The resulting array should contain a list of file names for each version number, but it appears to only contain early versions (2.6). Using print statements, I can see that the code works in that it fetches the sha256sums.asc files - all of them, all versions. I'm guessing I don't yet understanding something about populating Python arrays, and that my code isn't allowing the patches_full_versions array to include everything from versions 2.6 to 6.9, as expected.

If the array was somehow getting reset, I'd expect it to contain only the latest version number, but the opposite is happening - it only contains the earliest. It's like it simply stopped there even though the code is continuing to fetch info for later versions.

patches_versions = []
patches_full_versions = []

RT_PATCHES_BASE_URL = "https://cdn.kernel.org/pub/linux/kernel/projects/rt/"

cleaner = re.compile("<.*?>") # For removing HTML tags later.

# Create array of patches versions:
patches_page_content = requests.get(RT_PATCHES_BASE_URL)
patches_page_content.raise_for_status()
stripped_content = re.sub(cleaner, "", patches_page_content.text)
for line in stripped_content.splitlines():
    x = False
    x = re.findall(r"[0-9]", line)
    if not line == "":
        if x:
            patches_versions.append(line.split("/")[0])

patch_name_pattern = re.compile(r'patch-.*?\.tar\.xz')
for x in patches_versions:
    patch_version_page_content = requests.get(f"{RT_PATCHES_BASE_URL}{x}/sha256sums.asc")
    patch_version_page_content.raise_for_status()
    for match in re.findall(patch_name_pattern, patch_version_page_content.text):
        patches_full_versions.append(match)

Solution

  • First, to answer your question

    "but it appears to only contain early versions (2.6)"

    For example, there are 4 patches in this url: https://cdn.kernel.org/pub/linux/kernel/projects/rt/5.16/sha256sums.asc

    patch-5.16.2-rt19.patch.gz
    patch-5.16.2-rt19.patch.xz
    patches-5.16.2-rt19.tar.gz
    patches-5.16.2-rt19.tar.xz

    Your regex is:

    patch_name_pattern = re.compile(r'patch-.*?\.tar\.xz')
    

    which can not match:

    patch-5.16.2-rt19.patch.gz
    patch-5.16.2-rt19.patch.xz
    patches-5.16.2-rt19.tar.gz
    patches-5.16.2-rt19.tar.xz

    So none at this page can be matched.

    Here is the code I improved:

    patches_versions = []
    patches_full_versions = []
    
    RT_PATCHES_BASE_URL = "https://cdn.kernel.org/pub/linux/kernel/projects/rt/"
    
    cleaner = re.compile("<.*?>")  # For removing HTML tags later.
    
    # Create array of patches versions:
    patches_page_content = requests.get(RT_PATCHES_BASE_URL)
    patches_page_content.raise_for_status()
    stripped_content = re.sub(cleaner, "", patches_page_content.text)
    for line in stripped_content.splitlines():
        if re.search(r"\d", line):
            patches_versions.append(line.split("/")[0])
    
    patch_name_pattern = re.compile(r'patch[\w\-\.]+')
    for x in patches_versions:
        try:
            patch_version_page_content = requests.get(f"{RT_PATCHES_BASE_URL}{x}/sha256sums.asc")
            patch_version_page_content.raise_for_status()
            for match in re.findall(patch_name_pattern, patch_version_page_content.text):
                patches_full_versions.append(match)
        except Exception as e:
            print(repr(e))