python pandas machine-learning export-to-csv index-error

How to find the cause of and solve an IndexError that occurs when loading a .txt file ? Python

I am trying to train a sequence-to-sequence model for machine translation. I use a publicly available .txt dataset with two columns, of English to German phases (one pair per line, with a tab separating the languages): http://www.manythings.org/anki/deu-eng.zip This works well. However, I run into a problem when trying to use my own dataset.

My own DataFrame looks like this:

    Column 1    Column 2
0   English a   German a
1   English b   German b
2   English c   German c
3   English d   German d
4   ...         ...

To use it in the same script, I am saving this DataFrame to a .txt file as follows (aiming to again get one pair per line, with a tab separating the languages):

df.to_csv("dataset.txt", index=False, sep='\t')

The problem occurs in the code for cleaning the data:

# load doc into memory
def load_doc(filename):
# open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in lines]  

# clean a list of lines
 def clean_pairs(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')     
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]       
            # remove punctuation from each token
            line = [word.translate(table) for word in line]       
            # remove non-printable chars form each token
            line = [re_print.sub('', w) for w in line]                 
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]           
            # store as string
            clean_pair.append(' '.join(line))
#            print(clean_pair)
        cleaned.append(clean_pair)
#        print(cleaned)
    print(array(cleaned))
    return array(cleaned) # something goes wrong here

# save a list of clean sentences to file
def save_clean_data(sentences, filename):
    dump(sentences, open(filename, 'wb'))
    print('Saved: %s' % filename)

# load dataset
filename = 'data/dataset.txt'
doc = load_doc(filename)
# split into english-german pairs
pairs = to_pairs(doc)
# clean sentences
clean_pairs = clean_pairs(pairs)
# save clean pairs to file
save_clean_data(clean_pairs, 'english-german.pkl')
# spot check
 for i in range(100):
    print('[%s] => [%s]' % (clean_pairs[i,0], clean_pairs[i,1]))

The last line throws the following error:

IndexError                          Traceback (most recent call last)
<ipython-input-2-052d883ebd4c> in <module>()
     72 # spot check
     73 for i in range(100):
---> 74     print('[%s] => [%s]' % (clean_pairs[i,0], clean_pairs[i,1]))
     75 
     76 # load a clean dataset

IndexError: too many indices for array

One strange thing is that the output of the following line is different for the standard dataset vs my own dataset:

# Standard dataset:
return array(cleaned)
[['hi' 'hallo']
 ['hi' 'gru gott']
 ['run' ‘lauf’]]

# My own dataset:
return array(cleaned)
[list(['hi' 'hallo'])
 list(['hi' 'gru gott'])
 list(['run' ‘lauf’])]

Can anyone explain what the problem is and how to solve this?

Solution

clean_pairs is a list of lists. The core Python language does not formally have a concept of multi-dimensional arrays so the syntax you're using clean_pairs[i,0] does not work. It should be clean_pairs[i][0].

You probably got the idea from using Pandas which is using a more sophisticated n-d array data structure that supports that style of indexing.

I'm confused by your code though. It looks like you're saving a a dataframe to a TSV file (tab-separated) and then manually parsing the TSV and performing text transformations on it? There are multiple things wrong with this:

Just as you used a library to write a TSV file, you should use a library to read one too. A CSV/TSV reader will return your rows of data directly back to you in a pre-parsed data structure.
AFAICT you can do all this kind of filtering on the data in memory without writing to an intermediate file in the first place.

You've also got some other problems at least in the code you posted. For example your to_pairs function (which again is something you should be leaving up to a library, if at all) doesn't return anything.