Training Word2Vec Model from sourced data - Issue Tokenizing data

I have recently sourced and curated a lot of reddit data from Google Bigquery.

The dataset looks like this:

Before passing this data to word2vec to create a vocabulary and be trained, it is required that I properly tokenize the 'body_cleaned' column.

I have attempted the tokenization with both manually created functions and NLTK's word_tokenize, but for now I'll keep it focused on using word_tokenize.

Because my dataset is rather large, close to 12 million rows, it is impossible for me to open and perform functions on the dataset in one go. Pandas tries to load everything to RAM and as you can understand it crashes, even on a system with 24GB of ram.

I am facing the following issue:

When I tokenize the dataset (using NTLK word_tokenize), if I perform the function on the dataset as a whole, it correctly tokenizes and word2vec accepts that input and learns/outputs words correctly in its vocabulary.
When I tokenize the dataset by first batching the dataframe and iterating through it, the resulting token column is not what word2vec prefers; although word2vec trains its model on the data gathered for over 4 hours, the resulting vocabulary it has learnt consists of single characters in several encodings, as well as emojis - not words.

To troubleshoot this, I created a tiny subset of my data and tried to perform the tokenization on that data in two different ways:

Knowing that my computer can handle performing the action on the dataset, I simply did:

reddit_subset = reddit_data[:50]

reddit_subset['tokens'] = reddit_subset['body_cleaned'].apply(lambda x: word_tokenize(x))

This produces the following result:

This in fact works with word2vec and produces model one can work with. Great so far.

Because of my inability to operate on such a large dataset in one go, I had to get creative with how I handle this dataset. My solution was to batch the dataset and work on it in small iterations using Panda's own batchsize argument.

I wrote the following function to achieve that:

def reddit_data_cleaning(filepath, batchsize=20000):
    if batchsize:
        df = pd.read_csv(filepath, encoding='utf-8', error_bad_lines=False, chunksize=batchsize, iterator=True, lineterminator='\n')
    print("Beginning the data cleaning process!")
    start_time = time.time()
    flag = 1
    chunk_num = 1
    for chunk in df:
        chunk[u'tokens'] = chunk[u'body_cleaned'].apply(lambda x: word_tokenize(x))
        chunk_num += 1
    if flag == 1:
            chunk.dropna(how='any')
            chunk = chunk[chunk['body_cleaned'] != 'deleted']
            chunk = chunk[chunk['body_cleaned'] != 'removed']
            print("Beginning writing a new file")
            chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='w+', index=None, header=True)
            flag = 0
        else:
            chunk.dropna(how='any')
            chunk = chunk[chunk['body_cleaned'] != 'deleted']
            chunk = chunk[chunk['body_cleaned'] != 'removed']
            print("Adding a chunk into an already existing file")
            chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='a', index=None, header=None)
    end_time = time.time()
    print("Processing has been completed in: ", (end_time - start_time), " seconds.")

Although this piece of code allows me to actually work through this huge dataset in chunks and produces results where otherwise I'd crash from memory failures, I get a result which doesn't fit my word2vec requirements, and leaves me quite baffled at the reason for it.

I used the above function to perform the same operation on the Data subset to compare how the result differs between the two functions, and got the following:

The desired result is on the new_tokens column, and the function that chunks the dataframe produces the "tokens" column result.

Is anyone any wiser to help me understand why the same function to tokenize produces a wholly different result depending on how I iterate over the dataframe?

I appreciate you if you read through the whole issue and stuck through!

Solution

After taking gojomo's advice I simplified my approach at reading the csv file and writing to a text file.

My initial approach using pandas had yielded some pretty bad processing times for a file with around 12 million rows, and memory issues due to how pandas reads data all into memory before writing it out to a file.

What I also realized was that I had a major flaw in my previous code. I was printing some output (as a sanity check), and because I printed output too often, I overflowed Jupyter and crashed the notebook, not allowing the underlying and most important task to complete.

I got rid of that, simplified reading with the csv module and writing into a txt file, and I processed the reddit database of ~12 million rows in less than 10 seconds.

Maybe not the finest piece of code, but I was scrambling to solve an issue that stood as a roadblock for me for a couple of days (and not realizing that part of my problem was my sanity checks crashing Jupyter was an even bigger frustration).

def generate_corpus_txt(csv_filepath, output_filepath):
    import csv
    import time
    start_time = time.time()
    with open(csv_filepath, encoding = 'utf-8') as csvfile:
        datareader = csv.reader(csvfile)
        count = 0
        header = next(csvfile)
        print(time.asctime(time.localtime()), " ---- Beginning Processing")
        with open(output_filepath, 'w+') as output:
            # Check file as empty
            if header != None:
                for row in datareader:
                        # Iterate over each row after the header in the csv
                        # row variable is a list that represents a row in csv
                    processed_row = str(' '.join(row)) + '\n'
                    output.write(processed_row)
                    count += 1
                    if count == 1000000:
                        print(time.asctime(time.localtime()), " ---- Processed 1,000,000 Rows of data.")
                        count = 0
    print('Processing took:', int((time.time()-start_time)/60), ' minutes')
    output.close()
    csvfile.close()