Search code examples
pythontextstrip

strip text to create list and compare 2 similar lists


I need to split text out of file names which look like this: 'foo_bar_1_10.asc.gz' and I have a corresponding text list for each one of these files that looks like this: '1 10'. This corresponding list is what I want to re-create. The reason is I need to compare all of my files to a master list to find missing files. So ultimately I need a method to compare the two lists (diff?) Any help would be great

import os
newtxt = []
oldtxt = '\foobar\master_list.txt'
wd = '\foobar'


for file in os.listdir(wd):
    file = file.split('.')
    subpieces = file[0].split('_')
    numbers = ' '.join(subpieces[-2:])
    newtxt.append(numbers)
    print txt

@@@ Update @@@ I now I have 2 lists with line numbers (using a function similar to nl in unix -- named nl and output looks something like this 1: 1 10 and 2: 1 12. I need to check for missing values in newtxt from oldtxt. I've tried this:

s = set(nl(newtxt))
diff = [x for x in nl(oldtxt) if x not in s]
print diff

What this returns is some text characters and not what I expected. Any help?


Solution

  • It sounds like you're struggling with the string parsing part. First split up the file name into pieces by calling the string .split method, splitting by a period:

    >>> file = 'foo_bar_1_10.asc.gz'
    >>> pieces = file.split('.')
    >>> pieces
    ['foo_bar_1_10', 'asc', 'gz']
    

    Then split that up into subpieces based on the _ character:

    >>> subpieces = pieces[0].split('_')
    >>> subpieces
    ['foo', 'bar', '1', '10']
    

    You can then join the last two pieces back together, separated by a space, like this:

    >>> numbers = ' '.join(subpieces[-2:])
    >>> numbers
    '1 10'