Search code examples
pythonpython-2.7datasetpoker

Creating a large dataset in python line-by-line


For my graduate thesis I need to create a dataset of poker actions to test models with. I wrote a function that reads a text file with information about the hand and returns a list, which I append to a pandas data frame.

I have about 1500 files and each of them contains 1500~3000 hands that need to be passed to this function, so my main script is looking something like this.

import os
os.chdir("C:/Users/jctda/OneDrive/Documentos/TCC/Programa")

import pandas as pd
from datagen import DataGenerator, EmptyLine
from poker.room.pokerstars import PokerStarsHandHistory
from functions import FindFold, GetFiles, GetShowers
#IMPORT DATAGEN AQUI

database = pd.DataFrame()

files = GetFiles('hand_texts')
for hand_text in files:
    text=open('hand_texts/' + hand_text)
    b=text.read()
    hands=b.split("\n\n\n\n\n")
    text.close()

    for i in range(1,len(hands)):

        try:

            hh = PokerStarsHandHistory(unicode(hands[i]))
            hh.parse()
            fold = FindFold(hh)

            if fold == 'showdown':
                for shower in GetShowers(hh):
                    database = database.append(DataGenerator(hh,shower,hand_text,i))
                    print('Success in parsing iteration ' + str(i) + ' from file' + hand_text)

        except:

            print('PARSER ERROR ON ITERATION [[' + str(i) + ']] FROM FILE [[' + hand_text + ']]')
            database = database.append(EmptyLine(hand_text,i))




database.to_csv('database2.csv') 

The problem is that after a few hours running it becomes very slow. The first file takes about 20 seconds, but they get slower each time and after 8h running they start to take more than an hour each. I just started learning python for this project so I'm probably making a big mistake somewhere and causing it to take much more time than needed, but I can't find it.

Another thing that's been bugging me is that it consumes less than 1GB of RAM while it's running on a machine with 16GB. I thought about trying to force it to use more memmory but apparently there isn't a memmory limit on python, so I guess it's just bad code

Can someone help me figure out what to do?


Solution

  • As described in here, do not append to a dataframe inside a loop, as it is very inefficent. Rather do something like this:

    files = GetFiles('hand_texts')
    
    database = []
    for hand_text in files:
        # as a sidenote, with contexts are helpful for these:
        with open('hand_texts/' + hand_text) as text:
            b=text.read()
    
        hands=b.split("\n\n\n\n\n")
    
        for i in range(1,len(hands)):
            try:
                hh = PokerStarsHandHistory(unicode(hands[i]))
                hh.parse()
                fold = FindFold(hh)
    
                if fold == 'showdown':
                    for shower in GetShowers(hh): 
                        database.append(DataGenerator(hh,shower,hand_text,i))
                        print('Success in parsing iteration ' + str(i) + ' from file' + hand_text)
    
            except:
                print('PARSER ERROR ON ITERATION [[' + str(i) + ']] FROM FILE [[' + hand_text + ']]')
                database.append(EmptyLine(hand_text,i))
    
    pd.DataFrame(database).to_csv('database2.csv'))