Search code examples
pythonpandasparsingexport-to-csv

Python code to convert a bad txt file to csv file


I am trying to convert a bad text file with space delimited values to a clean csv file. Please guide me.

Below is my Data. Data hasn't been properly matched to the output columns in csv.

HP TRA ID        CL ID              IN/EId      No    Loop  Element Name                    Freq  STATUS         Error Severity  Error ID         Message                                                                                                                                                                                                                                                                                                       Report Source

13ZI       20712800032                                                             1     Denied         Error           HP_DOSOlderTh  Date of service is older than 12 months                                                                                                                                                                                                                                                                       HP           
13ZI       20712800032                 1                                           1     Rejected      Error           CA16            Rejected at  level. DupKeyID:0 is a Rejected of DupKeyID:0 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.                                                                                                                        HP           
13ZI       20712800032                 2                                           1     Rejected      Error           CA16            Rejected at  level. DupKeyID:1 is a Rejected of DupKeyID:1 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.                                                                                                                        HP           
13ZI       20712800032                 3                                           1     Rejected      Error           CA16            Rejected at  level. DupKeyID:2 is a Rejected of DupKeyID:2 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.                                                                                                                        HP           
1P8TY0J25       20712805263                                                             1     Denied         Error           HP_DOSOlderTh  Date of service is older than 12 months 

I have tried the below code but no luck.

df = pd.read_csv('file1.txt', sep='\t', index_col=False, dtype='object') 
df.to_csv(r'Report.csv', index = None) 

Also below line. Data hasn't been properly matched to the columns in csv

df = pd.read_csv("file1.txt", sep=r"\s{2,}", engine="python")
df.to_csv(r'Report.csv', index = None) 

I am expecting the output like this

output


Solution

  • Yes, that's a bad txt file. I wouldn't expect any built-in function to be able to handle it. You might be able to write some custom code to convert it to a proper CSV file. E.g., here's some code that works (very specifically) on the sample input, and you might be able to modify to work on the real file. But it's also possible that the file is basically ambiguous, and you'll have to hand-tweak any conversion result using your knowledge of what the data means.

    import sys, re, pprint
    
    table = []
    for (i, line) in enumerate(open('input.txt').readlines()):
        if line[-1] == '\n': line = line[:-1]
    
        if i == 0:
            # header
            # In the header, we can assume that no 'cell' would be empty,
            # so we can just split on runs of 2 or more spaces.
            row = re.split(r' {2,}', line)
            table.append(row)
            continue
    
        if i == 1:
            assert line == ''
            # blank line between header and data
            continue
    
        # For all other lines, we have to look at runs of 2+ spaces
        # and decide what they 'mean'.
    
        def replfunc(mo):
            L = len(mo.group(0))
    
            # Some 'Message' values say:
            # "Rejected at  level. DupKeyID..."
            # i.e., there's a run of 2 spaces *within* a cell value.
            # Deal with this particular case.
            if L == 2:
                (start, end) = mo.span()
                if (
                    line[:start].endswith('Rejected at')
                    and
                    line[end:].startswith('level.')
                ):
                    # Replace it with a single space.
                    return ' '
    
            # Otherwise, this run of spaces is equivalent to
            # one or more field-separators.
            # We'll replace it with tabs and then split on tabs.
    
            if L < 2:
                assert 0
            elif 2 <= L <= 12:
                return '\t'
            elif L == 17:
                return '\t\t'
            elif L == 43:
                return '\t\t\t'
            elif L == 61:
                return '\t\t\t\t\t'
            elif L == 120:
                return '\t'
            elif L == 263:
                return '\t'
            else:
                return f'<{L}>'
    
        tabbed_line = re.sub(r'\s{2,}', replfunc, line)
        row = tabbed_line.split('\t')
        table.append(row)
    
    # ---------------------------
    
    # The rest is just to display the resulting table nicely.
    
    max_n_fields = max(
        len(row)
        for row in table
    )
    
    field_widths = []
    for j in range(max_n_fields):
        field_width = max(
            len(row[j])
            for row in table
            if j < len(row)
        )
        field_widths.append(field_width)
    
    for (i, row) in enumerate(table):
        for (field, field_width) in zip(row, field_widths):
            print(field.ljust(field_width), end='|')
        print()
    

    Here's the output:

    HP TRA ID|CL ID      |IN/EId|No|Loop|Element Name|Freq|STATUS  |Error Severity|Error ID     |Message                                                                                                                                          |Report Source|
    13ZI     |20712800032|      |  |    |            |1   |Denied  |Error         |HP_DOSOlderTh|Date of service is older than 12 months                                                                                                          |HP           ||
    13ZI     |20712800032|      |1 |    |            |1   |Rejected|Error         |CA16         |Rejected at level. DupKeyID:0 is a Rejected of DupKeyID:0 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.|HP           ||
    13ZI     |20712800032|      |2 |    |            |1   |Rejected|Error         |CA16         |Rejected at level. DupKeyID:1 is a Rejected of DupKeyID:1 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.|HP           ||
    13ZI     |20712800032|      |3 |    |            |1   |Rejected|Error         |CA16         |Rejected at level. DupKeyID:2 is a Rejected of DupKeyID:2 from EncounterID:15C7XE9GV00 Claim ID:P_20712800032ALPHA_1649845496_19961109508100_716.|HP           ||
    1P8TY0J25|20712805263|      |  |    |            |1   |Denied  |Error         |HP_DOSOlderTh|Date of service is older than 12 months                                                                                                          |