Search code examples
pythondatabaselistsublistpdftotext

In python remove unwanted item from a list of data when I don't know where the unwanted data will pop up, or what the specific string will be?


This is the input that I have made up, but the structure is identical to the data I'm working with. I need to drop 'some stuff I dont want', but I don't know at which positions it will occur in the data. I also need to put the remaining data into sublists of 7 items. The data is pulled from a text layout of a PDF character by character and put into the 'Input' list. What I would like it to do is look at the first item in the list, check if it is an integer with less than 3 digits. If True, put that item and the next 6 into a sublist. If False, I want it to ignore the item and check the next item. I'd like it to continuously do this until it runs out of data to check and put into a sublist.

Input:

['1','1','2', '11" Some Words symbols and numbers mixed 3-4-2#', '4', '3.00', '43.00 NC', '1','1','2', '11" Some Words symbols and numbers mixed 3-4-2#', '3.00', '3','43.00 NC', 'some stuff I dont want', '1','1','2', '11" Some Words symbols and numbers mixed 3-4-2#', '3.00', '3', '43.00 NC']

The output should look like this: Output:

[['1','1','2', '11" Some Words symbols and numbers mixed 3-4-2#', '4', '3.00', '43.00 NC'], ['1','1','2', '11" Some Words symbols and numbers mixed 3-4-2#', '3.00', '3', '43.00 NC'], ['1','1','2', '11" Some Words symbols and numbers mixed 3-4-2#', '3.00', '3', '43.00 NC']]

I tried using a for loop and a while loop, but I can't seem to get the syntax right to put only the data I want into the sublist leaving out the data I do not want. Is there a way to do this that maybe I am missing?


Solution

  • Something like this might get you started:

    data = ['1','1','2', '11" Some Words symbols and numbers mixed 3-4-2#', '4', '3.00', '43.00 NC', '1','1','2', '11" Some Words symbols and numbers mixed 3-4-2#', '3.00', '3','43.00 NC', 'some stuff I dont want', '1','1','2', '11" Some Words symbols and numbers mixed 3-4-2#', '3.00', '3', '43.00 NC']
    
    all_sublists = []
    i = 0
    while i < len(data):
        try:
            if int(data[i]) < 100:
                all_sublists.append(data[i:i+7])
                i += 7
        except ValueError:
            i += 1
    
    all_sublists
    

    returns

    [['1',
      '1',
      '2',
      '11" Some Words symbols and numbers mixed 3-4-2#',
      '4',
      '3.00',
      '43.00 NC'],
     ['1',
      '1',
      '2',
      '11" Some Words symbols and numbers mixed 3-4-2#',
      '3.00',
      '3',
      '43.00 NC'],
     ['1',
      '1',
      '2',
      '11" Some Words symbols and numbers mixed 3-4-2#',
      '3.00',
      '3',
      '43.00 NC']]