Search code examples
pythonfilenamesextract

How do I extract a specific name from the filename of word document in a for loop (in python)?


Below is the for loop that loop the all the word document files. As you can see below, I have already printed the filename to see the output of it.

for filename in os.listdir(root_dir):
            source_directory = root_dir + '/' + filename
            # The output of filename is shown in the next section.
           -> print(filename)
            arr = mynotes_extractor.get_mynotes(source_directory)
            list2str = str(arr)
            c = cleanString(newstring=list2str)
            new_arr = []
            new_arr += [c]
            text_file = open(output, 'a', encoding='utf-8')
            for item in new_arr:
                text_file.write("%s\n" % item)

The below is the output after printing filename:

12345_Cat_A_My Notes.docx
6789_Cat_B_My Notes.docx
54321_Cat_A_My Notes.docx
12234_Cat_C_My Notes.docx
86075_Cat_D_My Notes.docx
34324_Cat_E_My Notes.docx

I would like to extract only the specific name, which is "My Notes" in all the filenames of word document inside the for loop as shown above.

For instance: 
         Before filename of word document extraction: 34324_Cat_E_My Notes.docx
         After filename of word document extraction: My Notes 

Solution

  • Written in one line tidiness but can be confusing when you are starting out.

    filename.split('.')[0].split('_')[-1]
    

    output: 'My Notes'

    Detailed explanation below:

    filename = '12345_Cat_A_My Notes.docx'
    

    .split('.') splits the string at every period

    >>>['12345_Cat_A_My Notes', 'docx']
    

    [0] takes the first element of the list

    >>>'12345_Cat_A_My Notes'
    

    .split('_') splits this string at each underscore returning

    >>>['12345', 'Cat', 'A', 'My Notes']
    

    [-1] Finally, takes the last item in the list with returning

    >>>'My Notes'