I am trying to load the text file into Python, but due to the spaces in the string with multiple words, it is considering each word as a separate column. What's wrong, and how can I fix this?
Data:
Name 2000–12 2012–13 2013–14 2012 2012 2012 2012 2012 2012 2012
Costa Rica 4.7 3.4 4.3 15.9 15.1 –5.3 –3.5 .. 4.5 49.4
Côte d’Ivoire 1.2 8.7 8.2 .. .. 2.0 –3.1 .. 1.3 39.0
Croatia 2.1 .. .. 18.9 9.3 –0.3 –4.7 .. 3.4 80.7
Cuba 5.8 .. .. .. .. .. .. .. .. ..
Curaçao .. .. .. .. .. .. .. .. .. ..
Cyprusb 2.6c .. .. 8.8c 3.9c –6.9 –6.3 113.3 2.4 ..
Czech Republic 3.3 .. .. 21.0 5.1 –2.4 –4.4 38.3 3.3 77.3
Denmark 0.6 .. .. 23.6 15.7 5.9 –2.0 50.6 2.4 74.6
Djibouti 3.5 .. .. .. .. .. .. .. 3.7 ..
Dominica 3.2 1.1 1.7 10.8 .. –11.5 –11.9 .. 1.4 97.4
Dominican Republic 5.6 2.5 3.9 9.2 .. –6.8 –2.9 .. 3.7 34.3
Ecuador 4.4 4.0 4.1 26.9 6.1 –0.2 .. .. 5.1 31.6
Egypt, Arab Rep. 4.9 1.8 2.3 13.0 0.0 –2.7 –10.6 .. 7.1 74.1
What I have tried so far:
q3=pd.read_csv("E:\DRISTIA\Question2.txt",skiprows=2,encoding='unicode_escape',header=0,engine='python',skipinitialspace=True,delim_whitespace=True)
print(q3)
Name 2000–12 2012–13 2013–14 2012 2012.1 2012.2 2012.3 2012.4 \
Costa Rica 4.7 3.4 4.3 15.9 15.1 –5.3 –3.5 ..
Côte d’Ivoire 1.2 8.7 8.2 .. .. 2.0 –3.1 ..
Croatia 2.1 .. .. 18.9 9.3 –0.3 –4.7 .. 3.4
Cuba 5.8 .. .. .. .. .. .. .. ..
Curaçao .. .. .. .. .. .. .. .. ..
Cyprusb 2.6c .. .. 8.8c 3.9c –6.9 –6.3 113.3 2.4
Czech Republic 3.3 .. .. 21.0 5.1 –2.4 –4.4 38.3
Denmark 0.6 .. .. 23.6 15.7 5.9 –2.0 50.6 2.4
Djibouti 3.5 .. .. .. .. .. .. .. 3.7
Dominica 3.2 1.1 1.7 10.8 .. –11.5 –11.9 .. 1.4
Dominican Republic 5.6 2.5 3.9 9.2 .. –6.8 –2.9 ..
Notice how e.g. the second word in "Costa Rica" and "Côte d'Ivoire" etc are pushed into the first data column. I would like for those to stay in the Name column.
Required output:
Name 2000–12 2012–13 2013–14 2012 2012 2012 2012 2012 2012 2012
Costa Rica 4.7 3.4 4.3 15.9 15.1 –5.3 –3.5 .. 4.5 49.4
i.e. all the strings of Country name should fall under Name column. All the other columns are regular and do not contain any whitespace.
The Pandas read_csv
function accepts a custom delimiter sep
which can be a regular expression. Now the task is to articulate a regex which only matches the last n-1 spaces in a line, where n is the number of columns in the file.
q3=pd.read_csv(
"Question2.txt", engine='python', skiprows=2, encoding='unicode_escape',
sep=r'\s+(?!\S+(?:\s+\S+){10})')
print(q3)
The regular expression matches whitespace (\s+
) but only if it is not followed (?!...)
by ten or more whitespace-separated columns.
The sample data you provided doesn't seem to exactly match what your code expects, but with a couple of empty lines added at the top, I get
Name 2000–12 2012–13 ... 2012.4 2012.5 2012.6
0 Costa Rica 4.7 3.4 ... .. 4.5 49.4
1 Côte d’Ivoire 1.2 8.7 ... .. 1.3 39.0
2 Croatia 2.1 .. ... .. 3.4 80.7
3 Cuba 5.8 .. ... .. .. ..
4 Curaçao .. .. ... .. .. ..
5 Cyprusb 2.6c .. ... 113.3 2.4 ..
6 Czech Republic 3.3 .. ... 38.3 3.3 77.3
7 Denmark 0.6 .. ... 50.6 2.4 74.6
8 Djibouti 3.5 .. ... .. 3.7 ..
9 Dominica 3.2 1.1 ... .. 1.4 97.4
10 Dominican Republic 5.6 2.5 ... .. 3.7 34.3
11 Ecuador 4.4 4.0 ... .. 5.1 31.6
12 Egypt, Arab Rep. 4.9 1.8 ... .. 7.1 74.1
[13 rows x 11 columns]
(Notice the Unicode mojibake, because of the slightly weird encoding
keyword argument.)
Probably as your very first task save the result in a less inane format, probably as proper CSV (comma-delimited, with quoting around any field which contains literal commas; but Pandas to_csv()
takes care of all of this for you).
As an aside, hardcoding the file name with a full path is probably going to make your script less useful. Maybe take out the path, like I have done above, and run the script in the directory where you have the input file, or (less usefully, but sometimes more practically) put in a relative path to a subdirectory, and run it from the corresponding parent directory.