Search code examples
pythoncsvexport-to-exceltext-miningxlrd

Huge text file to small excel files


I have a huge text file (4 GB), where each "line" is of the syntax: [number] [number]_[number] [Text].
For example

123 12_14 Text 1
1234 13_456 Text 2
33 12_12 Text 3
24 678_10 Text 4

My purpose is to have this data saved as Excel file, where each "line" in the text file,
is a row in the excel file. According to the past example:

[A1] 123
[B1] 12_14
[C1] Text 1
[A2] 1234
[B2] 13_456
[C2] Text 2
[A3] 33
[B3] 12_12
[C3] Text 3
[A4] 24
[B4] 678_10
[C4] Text 4

My plan is to iterate the text "lines", as advised here, separate the "lines",
and save to the cells in an excel file.

Because of the text size issue, I thought to create many small excel files, which all together will be equal to the text file.

Than I need to analyze the small excel files, mainly found terms that where mentioned in the [Text] cells, and count the number of apperance, related to the [number] cells (representing a post and ID of a post).

Finally, I need to sum all this data in an excel file.

I'm considering the best way to create and analyze the excel files.
As mentioned here the main libraries are xlrd and csv.


Solution

  • "I'm pretty sure I don't have other options than small excel files, but what will be the another approach?"

    Your huge text file is a type of database, although an inconvenient one. A bunch of small Excel files are another, even less convenient representation of the same database. I assume you are looking to make a bunch of small files because Excel has an upper limit on how many rows it can contain (65'000 or 1'000'000 depending on the version of Excel). However, as has been noted, Excel files are truly horrible database stores.

    Since you are already using Python, use module sqlite3, it's already built in and it's a real database, and it can handle more than a million rows. And it's fast.

    But I wanted to get an idea how fast it is with data on the scale that you propose so I created a 30M row database of roughly the same complexity as your dataset. The schema is simple:

    create table words
        (id integer primary key autoincrement not null,
         first text, second text, third text);
    

    and populated it with random trigrams drawn from /usr/dict/words (I have a module for generating test data like this which makes entries that look like

    sqlite> select * from words limit 5;
    1|salvation|prorates|bird
    2|fore|embellishment|empathized
    3|scalier|trinity|graze
    4|exes|archways|interrelationships
    5|voguish|collating|partying
    

    but a simple query for a row I knew was near the end took longer than I'd hoped:

    select * from words where first == "dole" and second == "licked";
    29599672|dole|licked|hates
    

    took about 7 seconds on a pretty average 3-year-old desktop so I added a couple of indexes

    create index first on words (first);
    create index second on words (second);
    

    which did double the size of the database file from 1.1GB to 2.3GB but brought the simple query time down to a rather reasonable 0.006 second. I don't think you'll do as well with Excel.

    So parse your data however you must, but then put it in a real database.