Search code examples
pythoncsvpandasfixed-length-record

Pre-process data file before pandas read_csv


I work with data output from SAP, but it is neither CSV, as it does not quote strings containing its delimiter, nor fixed width as it has multi-byte chars. It is kind of a "fixed width" character-wise.

To get it into pandas I currently read the file, get the delimiters position, slice every line around the delimiters and then save it to a proper CSV which I can read without trouble.

I see that pandas read_csv can get a file buffer. How would I pass my stream straight to it, without saving a csv file? Should I make a generator? Can I get csv.writer.writerow output without giving it a file handle?

Here is my code:

import pandas as pd

caminho= r'C:\Users\user\Documents\SAP\Tests\\'
arquivo = "ExpComp_01.txt"
tipo_dado = {"KEY_GUID":"object", "DEL_IND":"object", "HDR_GUID":"object", , "PRICE":"object", "LEADTIME":"int16", "MANUFACTURER":"object", "LOAD_TIME":"object", "APPR_TIME":"object", "SEND_TIME":"object", "DESCRIPTION":"object"} 

def desmembra(linha, limites):
    # This functions receives each delimiter's index and cuts around it
    posicao=limites[0]    
    for limite in limites[1:]:
        yield linha[posicao+1:limite]
        posicao=limite

def pre_processa(arquivo):
    import csv
    import os
    # Translates SAP output in standard CSV
    with open(arquivo,"r", encoding="mbcs") as entrada, open(arquivo[:-3] +
    "csv", "w", newline="", encoding="mbcs") as saida:
        escreve=csv.writer(saida,csv.QUOTE_MINIMAL, delimiter=";").writerow
        for line in entrada:
            # Find heading
            if line[0]=="|":
                delimitadores = [x for x, v in enumerate(line) if v == '|']
                if line[-2] != "|": 
                    delimitadores.append(None)
                cabecalho_teste=line[:50]
                escreve([campo.strip() for campo in desmembra(line,delimitadores)])
                break
        for line in entrada:
            if line[0]=="|" and line[:50]!=cabecalho_teste:
                escreve([campo.strip() for campo in desmembra(line, delimitadores)])

pre_processa(caminho+arquivo)       
dados = pd.read_csv(caminho + arquivo[:-3] + "csv", sep=";",
                    header=0, encoding="mbcs", dtype=tipo_dado)

Also, if you could share best practices: I have odd datetime strings as this 20.120.813.132.432 which I can successfully convert using

dados["SEND_TIME"]=pd.to_datetime(dados["SEND_TIME"], format="%Y%m%d%H%M%S")
dados["SEND_TIME"].replace(regex=False,inplace=True,to_replace=r'.',value=r'')

I can't write a parser for it because I have dates stored in different string formats. Would it be faster to specify a converter to do it during import or have pandas do it column-wise in the end? I have a similar issue with a code 99999999 that I have to add dots to 99.999.999. I do not know if I should write a converter or wait until after the import to do a df.replace

EDIT -- Sample data:


|                        KEY_GUID|DEL_IND|                        HDR_GUID|Prod_CD |DESCRIPTION                      |      PRICE|LEADTIME|MANUFACTURER|          LOAD_TIME|APPR_TIME     |          SEND_TIME|
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|000427507E64FB29E2006281548EB186|       |4C1AD7E25DC50D61E10000000A19FF83|75123636|Vneráéíoaeot.sadot.m             |     29,55 |30      |            |20.120.813.132.432 |20120813132929|20.120.505.010.157 |
|000527507E64FB29E2006281548EB186|       |4C1AD7E25DC50D61E10000000A19FF83|75123643|Tnerasodaeot|sadot.m             |    122,91 |30      |            |20.120.813.132.432 |20120813132929|20.120.505.010.141 |
|0005DB50112F9E69E10000000A1D2028|       |384BB350BF56315DE20062700D627978|75123676|Dnerasodáeot.sadot.m             |252.446,99 |3       |POLAND      |20.121.226.175.640 |20121226183608|20.121.222.000.015 |
|000627507E64FB29E2006281548EB186|       |4C1AD7E25DC50D61E10000000A19FF83|75123652|Pner|sodaeot.sadot.m             |    657,49 |30      |            |20.120.813.132.432 |20120813132929|20.120.505.010.128 |
|000727507E64FB29E2006281548EB186|       |4C1AD7E25DC50D61E10000000A19FF83|        |Rnerasodaeot.sadot.m             |    523,63 |30      |            |20.120.813.132.432 |20120813132929|20.120.707.010.119 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|                        KEY_GUID|DEL_IND|                        HDR_GUID|Prod_CD |DESCRIPTION                      |      PRICE|LEADTIME|MANUFACTURER|          LOAD_TIME|APPR_TIME     |          SEND_TIME|
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------   |000827507E64FB29E2006281548EB186|       |4C1AD7E25DC50D61E10000000A19FF83|75123603|Inerasodéeot.sadot.m             |  2.073,63 |30      |            |20.120.813.132.432 |20120813132929|20.120.505.010.127 |
|000927507E64FB29E2006281548EB186|       |4C1AD7E25DC50D61E10000000A19FF83|75123662|Ane|asodaeot.sadot.m             |      0,22 |30      |            |20.120.813.132.432 |20120813132929|20.120.505.010.135 |
|000A27507E64FB29E2006281548EB186|       |4C1AD7E25DC50D61E10000000A19FF83|75123626|Pneraíodaeot.sadot.m             |    300,75 |30      |            |20.120.813.132.432 |20120813132929|20.120.505.010.140 |
|000B27507E64FB29E2006281548EB186|       |4C1AD7E25DC50D61E10000000A19FF83|        |Aneraéodaeot.sadot.m             |      1,19 |30      |            |20.120.813.132.432 |20120813132929|20.120.505.010.131 |
|000C27507E64FB29E2006281548EB186|       |4C1AD7E25DC50D61E10000000A19FF83|75123613|Cnerasodaeot.sadot.m             |     30,90 |30      |            |20.120.813.132.432 |20120813132929|20.120.505.010.144 |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I would be dealing with other tables with other fields. All in this general form. I can only trust the separators in the heading. Also I may have repeated headings in the data. It looks like a matricial printout.


Solution

  • If you want to build a DataFrame without first writing to a CSV, then you don't need pd.read_csv. While it is possible to use io.BytesIO or cString.StringIO to write to an in-memory file-like object, it doesn't make sense to convert an iterable of values (like desmembra(line, delimitadores)) to a single string just to re-parse it with pd.read_csv.

    Instead, it is more direct to use pd.DataFrame, since pd.DataFrame can accept an iterator of row data.

    Operating on values one-by-one using plain Python is usually not the fastest way to go. Generally, using Pandas functions on whole columns is faster. Therefore, I would parse arquivo into a DataFrame of strings first, and then use Pandas functions to post-process the columns into the correct dtype and values.


    import pandas as pd
    import os
    import csv
    import io
    
    caminho = r'C:\Users\u5en\Documents\SAP\Testes\\'
    arquivo = os.path.join(caminho, "ExpComp_01.txt")
    arquivo_csv = os.path.splitext(arquivo)[0] + '.csv'
    
    def desmembra(linha, limites):
        # This functions receives each delimiter's index and cuts around it
        return [linha[limites[i]+1:limites[i+1]].strip()
                for i in range(len(limites[:-1]))]
    
    def pre_processa(arquivo, enc):
        # Translates SAP output into an iterator of lists of strings
        with io.open(arquivo, "r", encoding=enc) as entrada:
            for line in entrada:
                # Find heading
                if line[0] == "|":
                    delimitadores = [x for x, v in enumerate(line) if v == '|']
                    if line[-2] != "|": 
                        delimitadores.append(None)
                    cabecalho_teste = line[:50]
                    yield desmembra(line, delimitadores)
                    break
            for line in entrada:
                if line[0] == "|" and line[:50] != cabecalho_teste:
                    yield desmembra(line, delimitadores)                
    
    def post_process(dados):
        dados['LEADTIME'] = dados['LEADTIME'].astype('int16')
        for col in ('SEND_TIME', 'LOAD_TIME', 'PRICE'):
            dados[col] = dados[col].str.replace(r'.', '')
        for col in ('SEND_TIME', 'LOAD_TIME', 'APPR_TIME'):
            dados[col] = pd.to_datetime(dados[col], format="%Y%m%d%H%M%S")
        return dados
    
    enc = 'mbcs'  
    saida = pre_processa(arquivo, enc)
    header = next(saida)
    dados = pd.DataFrame(saida, columns=header)
    dados = post_process(dados)
    print(dados)
    

    yields

                               KEY_GUID DEL_IND                          HDR_GUID  \
    0  000427507E64FB29E2006281548EB186          4C1AD7E25DC50D61E10000000A19FF83   
    1  000527507E64FB29E2006281548EB186          4C1AD7E25DC50D61E10000000A19FF83   
    2  0005DB50112F9E69E10000000A1D2028          384BB350BF56315DE20062700D627978   
    3  000627507E64FB29E2006281548EB186          4C1AD7E25DC50D61E10000000A19FF83   
    4  000727507E64FB29E2006281548EB186          4C1AD7E25DC50D61E10000000A19FF83   
    5  000927507E64FB29E2006281548EB186          4C1AD7E25DC50D61E10000000A19FF83   
    6  000A27507E64FB29E2006281548EB186          4C1AD7E25DC50D61E10000000A19FF83   
    7  000B27507E64FB29E2006281548EB186          4C1AD7E25DC50D61E10000000A19FF83   
    8  000C27507E64FB29E2006281548EB186          4C1AD7E25DC50D61E10000000A19FF83   
    
        Prod_CD           DESCRIPTION      PRICE  LEADTIME MANUFACTURER  \
    0  75123636  Vneráéíoaeot.sadot.m      29,55        30                
    1  75123643  Tnerasodaeot|sadot.m     122,91        30                
    2  75123676  Dnerasodáeot.sadot.m  252446,99         3       POLAND   
    3  75123652  Pner|sodaeot.sadot.m     657,49        30                
    4            Rnerasodaeot.sadot.m     523,63        30                
    5  75123662  Ane|asodaeot.sadot.m       0,22        30                
    6  75123626  Pneraíodaeot.sadot.m     300,75        30                
    7            Aneraéodaeot.sadot.m       1,19        30                
    8  75123613  Cnerasodaeot.sadot.m      30,90        30                
    
                LOAD_TIME           APPR_TIME           SEND_TIME  
    0 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:57  
    1 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:41  
    2 2012-12-26 17:56:40 2012-12-26 18:36:08 2012-12-22 00:00:15  
    3 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:28  
    4 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-07-07 01:01:19  
    5 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:35  
    6 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:40  
    7 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:31  
    8 2012-08-13 13:24:32 2012-08-13 13:29:29 2012-05-05 01:01:44