I need to scrape a data table that is aligned with spaces. This is not an HTML table, but I'm having a hard time getting it right. The table looks like:
2017-10-28 @Westmont 100 Cal Lutheran 76
2017-10-30 @Arizona Chr 94 E New Mexico 87
2017-10-31 @Walsh 91 Mt Union 80
2017-10-31 @Card Stritch 71 Maranatha Bap 42
2017-11-01 @WV Tech 82 Glenville St 80
...
2018-03-31 Villanova 95 Kansas 79 P NCAA Tournament San Antonio, TX
2018-03-31 Michigan 69 Loyola-Chicago 57 P NCAA Tournament San Antonio, TX
2018-04-02 Villanova 79 Michigan 62 P NCAA I Championship San Antonio, TX
Because it is plain text, I pasted it into a text document and used read.table
, but I was losing almost half the lines, and I have no idea why. I figured out how to extract the data I wanted from the lines that made it in, so I'm looking for either of two solutions:
Here is a python script that should do the job. Basically you can use your favorite programming language along with some trivial regex tricks to do it.
import re
with open('data.txt', 'r') as inputFile:
with open ('cleanedUp.csv', 'w') as outputFile:
regex = re.compile("\s\s+|\s@")
for line in inputFile:
cleanedUp = regex.split(line.strip())
outputFile.write(','.join(cleanedUp) + '\n')