Search code examples
pythonpandascsvdatasetlarge-data

Large chunk of data, trying to replace delimiter character in one column but not all delimiter characters


I'm trying to analyse a dataset (200mb, ~3mil lines) with Pandas, but this dataset contains one column filled by random users where all characters are allowed. This creates the problem that the delimiter character (\t) is also often found in the last column. In GVIM I tried searching and replacing all '\t's with ':'s, but this also replaces the actual delimiters (which are not shown in GVIM). Does anyone have tips as to how to specifically replace a character in a certain column in a large dataset?

E.g. the file I have uses tab (\t) as delimiter for its columns. However, there's a row with the following info:

137179101   genericname 431 499 19568   0   26 Apr 2010 01:45:05 GMT    Oakland,\   CA

Which messes up my Pandas notebook, because it expects 8 fields but gets 9 due to the tab between \ and CA. The 8th column is the only place where its actually possible to fill in your own tabs. So how do I change the "\t"s in ONLY the eighth column?

Thanks in advance!

Note that I have zero experience with Python scripting which I presume could solve it, but a faster approach than learning Python would be nice.


Solution

  • I think you can first create DataFrame by read_csv with sep='₩' and then use str.split with parameter n=7, because you need split by first 7 separators only:

    import pandas as pd
    import io
    
    temp=u"""137179101   genericname 431 499 19568   0   26 Apr 2010 01:45:05 GMT    Oakland,\   CA
    137179101   genericname 431 499 19568   0   26 Apr 2010 01:45:05 GMT    Oakland,\   CA
    137179101   genericname 431 499 19568   0   26 Apr 2010 01:45:05 GMT    Oakland,\   CA"""
    #after testing replace io.StringIO(temp) to Location
    df  = pd.read_csv(io.StringIO(temp), sep="₩", header=None, names=['data'], encoding='Latin-1')
    
    with pd.option_context('display.max_colwidth', -1):
        print df 
                                                                                         data
    0  137179101   genericname 431 499 19568   0   26 Apr 2010 01:45:05 GMT    Oakland,\   CA
    1  137179101   genericname 431 499 19568   0   26 Apr 2010 01:45:05 GMT    Oakland,\   CA
    2  137179101   genericname 431 499 19568   0   26 Apr 2010 01:45:05 GMT    Oakland,\   CA   
    
    print df.data.str.split('\t',n=7).apply(pd.Series)
               0            1    2    3      4  5                         6  \
    0  137179101  genericname  431  499  19568  0  26 Apr 2010 01:45:05 GMT   
    1  137179101  genericname  431  499  19568  0  26 Apr 2010 01:45:05 GMT   
    2  137179101  genericname  431  499  19568  0  26 Apr 2010 01:45:05 GMT   
    3  137179101  genericname  431  499  19568  0  26 Apr 2010 01:45:05 GMT   
    
                   7  
    0    Oakland,\CA  
    1    Oakland,\CA  
    2  Oakland,\\tCA  
    3        Oakland