I'm trying to analyse a dataset (200mb, ~3mil lines) with Pandas, but this dataset contains one column filled by random users where all characters are allowed. This creates the problem that the delimiter character (\t) is also often found in the last column. In GVIM I tried searching and replacing all '\t's with ':'s, but this also replaces the actual delimiters (which are not shown in GVIM). Does anyone have tips as to how to specifically replace a character in a certain column in a large dataset?
E.g. the file I have uses tab (\t) as delimiter for its columns. However, there's a row with the following info:
137179101 genericname 431 499 19568 0 26 Apr 2010 01:45:05 GMT Oakland,\ CA
Which messes up my Pandas notebook, because it expects 8 fields but gets 9 due to the tab between \ and CA. The 8th column is the only place where its actually possible to fill in your own tabs. So how do I change the "\t"s in ONLY the eighth column?
Thanks in advance!
Note that I have zero experience with Python scripting which I presume could solve it, but a faster approach than learning Python would be nice.
I think you can first create DataFrame by read_csv
with sep='₩'
and then use str.split
with parameter n=7
, because you need split by first 7
separators only:
import pandas as pd
import io
temp=u"""137179101 genericname 431 499 19568 0 26 Apr 2010 01:45:05 GMT Oakland,\ CA
137179101 genericname 431 499 19568 0 26 Apr 2010 01:45:05 GMT Oakland,\ CA
137179101 genericname 431 499 19568 0 26 Apr 2010 01:45:05 GMT Oakland,\ CA"""
#after testing replace io.StringIO(temp) to Location
df = pd.read_csv(io.StringIO(temp), sep="₩", header=None, names=['data'], encoding='Latin-1')
with pd.option_context('display.max_colwidth', -1):
print df
data
0 137179101 genericname 431 499 19568 0 26 Apr 2010 01:45:05 GMT Oakland,\ CA
1 137179101 genericname 431 499 19568 0 26 Apr 2010 01:45:05 GMT Oakland,\ CA
2 137179101 genericname 431 499 19568 0 26 Apr 2010 01:45:05 GMT Oakland,\ CA
print df.data.str.split('\t',n=7).apply(pd.Series)
0 1 2 3 4 5 6 \
0 137179101 genericname 431 499 19568 0 26 Apr 2010 01:45:05 GMT
1 137179101 genericname 431 499 19568 0 26 Apr 2010 01:45:05 GMT
2 137179101 genericname 431 499 19568 0 26 Apr 2010 01:45:05 GMT
3 137179101 genericname 431 499 19568 0 26 Apr 2010 01:45:05 GMT
7
0 Oakland,\CA
1 Oakland,\CA
2 Oakland,\\tCA
3 Oakland