Search code examples
rweb-scrapingdatatable

How to Scrape Text (not HTML) Table


I need to scrape a data table that is aligned with spaces. This is not an HTML table, but I'm having a hard time getting it right. The table looks like:

2017-10-28 @Westmont                100  Cal Lutheran             76           
2017-10-30 @Arizona Chr              94  E New Mexico             87           
2017-10-31 @Walsh                    91  Mt Union                 80           
2017-10-31 @Card Stritch             71  Maranatha Bap            42           
2017-11-01 @WV Tech                  82  Glenville St             80           
...
2018-03-31  Villanova                95  Kansas                   79 P        NCAA Tournament San Antonio, TX
2018-03-31  Michigan                 69  Loyola-Chicago           57 P        NCAA Tournament San Antonio, TX
2018-04-02  Villanova                79  Michigan                 62 P        NCAA I Championship San Antonio, TX

Because it is plain text, I pasted it into a text document and used read.table, but I was losing almost half the lines, and I have no idea why. I figured out how to extract the data I wanted from the lines that made it in, so I'm looking for either of two solutions:

  • An easy way to scrape a table that looks like this (link to actual data), and get it into a dataframe (or csv).
  • A way to get all of the lines of my data / a reason I'm losing a bunch of my data (I'm getting 8,861 of 16445 lines)

Solution

  • Here is a python script that should do the job. Basically you can use your favorite programming language along with some trivial regex tricks to do it.

    import re
    
    with open('data.txt', 'r') as inputFile:
        with open ('cleanedUp.csv', 'w') as outputFile:
            regex = re.compile("\s\s+|\s@")
            for line in inputFile:
                cleanedUp = regex.split(line.strip())
                outputFile.write(','.join(cleanedUp) + '\n')