Search code examples
pythondictionarytext-filesglobstring-concatenation

Build dict mapping from multiple text files


I have multiple *.txt file with an ID and a value, and I want to create a unique dictionary. However, some IDs are repeated in some files, and for those ID, I want to CONCATENATED the values. This is an example of two files (but I have a bunch of files, so I think I need glob.glob):(Notice all the 'values' in a certain file have the same length, so I can add '-' as many times the len(value) is missing.

File 1

ID01
Hi 
ID02 
my 
ID03 
ni

File 2

ID02 
name
ID04 
meet 
ID05 
your

Desire Output: (Notice that when there is no repetitive ID, I want to add 'Na' or '-', with the same len(value) to be concatenated) This is my output:

ID01 
Hi----
ID02 
myname
ID03 
ni----
ID04 
--meet
ID05 
--your

I just want to store the output in a dictionary. Additionally, I guess if I print the file when is open, I could know the order of which files are being opened after the other, right?

This is what I have: (I cannot concatenate my values so far)

output={}   
list = []   
for file in glob.glob('*.txt'):        
    FI = open(file,'r') 
    for line in FI.readlines():
        if (line[0]=='I'):      #I am interested in storing only the ones that start with I, for a future analysis. I know this can be done separating key and value with '\t'. Also, I am sure the next lines (values) does not start with 'I'
            ID = line.rstrip()
            output[ID] = ''
            if ID not in list:
                list.append(ID)     
        else:
            output[ID] = output[ID] + line.rstrip()

    if seqs_name in list:
        seqs[seqs_name] += seqs[seqs_name]

    print (file)
    FI.close()


print ('This is your final list: ')
print (list) #so far, I am getting the right final list, with no repetitive ID 
print (output) #PROBLEM: the repetitive ID, is being concatenated twice the 'value' in the last file read.

Also, How to add the '-' when the ID is not repeated? I would greatly appreciate your help.

To sum up: I cannot concatenate values when the key is repeated in another file. And if key are not repeated, I want to add '-' , so I could later print the file name and know in which file certain ID does not have a value.


Solution

  • A couple of issues with your existing code:

    1. line[0] == 'ID': line[0] returns a character, so this comparison is always false. Use str.startswidth(xxx) instead, to check if a string begins with xxx.

    2. You are not retrieving the text after the ID properly. The easiest way to do this is by calling next(f).

    3. You don't need a second list. Also, don't name your variable list as it shadows the builtin.


    import collections
    
    output = collections.defaultdict(str)   
    for file in glob.glob('*.txt'):        
        with open(file, 'r') as f: 
        for line in f:
            if line.startswith('ID'):   
                try: 
                    text = next(f)
                    output[line.strip()] += text.strip() + ' ' 
                except StopIteration:
                    pass  
    
    print(output)
    

    It never hurts to catch the odd exception, using try-except.