Search code examples
pythondata-extraction

Complex Data Extraction in Python


I need some help starting a program. I play a couple of online poker tournaments every week. It turns out that the site I use records hand histories and saves them to my hard drive as .txt files. The data is in a somewhat rough format, unfortunately. I want to create a program that takes each hand and tells me how much I won or lost. I've pasted a sample from a hand below and I want to extract the following the information from each hand.

  1. The blinds and antes. In you scroll down through the sample, you can see "Player 8 has small blind (250)" followed by "Player 1 has big blind (500)". The antes are noted above for each player "Player Hero ante (50)". So in this case, small blind = 250, big blind = 500, ante = 50.

  2. My stack size. I've denoted my player as "Hero". My stack size is on line 6 where it says "Seat 3: Hero (17595)". My stack size is 17595 in this instance.

  3. My hand. In this example, it is denoted by "Player Hero received card: [10c]; Player Hero received card: [7h]." So my hand is "10c7h"

  4. Number of Players. In the sample, there are 8 players.

  5. My position. This one might be tricky. I've decided to start with the Big Blind and assign it a value of 0. Small blind = 1, button = 2, etc. This goes against "poker logic" to some extent, but from a programming standpoint, makes more sense to me because there will always be a Big Blind, whereas some of the other positions will depend on how many players are at the table.

  6. Profit / Loss. This is near the bottom of the text under the "Summary" label. "Player Hero does not show cards.Bets: 50. Collects: 0. Loses: 50." In this instance, my profit was -50 (i.e. loss of 50), which means I paid the 50 ante and folded my hand.

Here is how the .txt file looks. Note this is 1 hand. In the actual .txt files, this hand would be followed by dozens or hundreds of others hands. The beginning is always denoted by "Game started" and the last line always says "Game ended".

Game started at: 2018/1/9 10:14:10
Game ID: 1094127759 250/500 $5,000 GTD, Table 4 (Hold'em)
Seat 7 is the button
Seat 1: Player1 (9650).
Seat 2: Player2 (19433).
Seat 3: Hero (17595).
Seat 4: Player4 (8900).
Seat 5: Player5 (12670).
Seat 6: Player6 (11187).
Seat 7: Player7 (11300).
Seat 8: Player8 (17603).
Player Player8 ante (50)
Player Player1 ante (50)
Player Player2 ante (50)
Player Hero ante (50)
Player Player4 ante (50)
Player Player5 ante (50)
Player Player6 ante (50)
Player Player7 ante (50)
Player Player8 has small blind (250)
Player Player1 has big blind (500)
Player Player8 received a card.
Player Player8 received a card.
Player Player1 received a card.
Player Player1 received a card.
Player Player2 received a card.
Player Player2 received a card.
Player Hero received card: [10c]
Player Hero received card: [7h]
Player Player4 received a card.
Player Player4 received a card.
Player Player5 received a card.
Player Player5 received a card.
Player Player6 received a card.
Player Player6 received a card.
Player Player7 received a card.
Player Player7 received a card.
Player Player2 folds
Player Hero folds
Player Player4 raises (1000)
Player Player5 folds
Player Player6 folds
Player Player7 folds
Player Player8 folds
Player Player1 folds
Uncalled bet (500) returned to Player4
Player Player4 mucks cards
------ Summary ------
Pot: 1650
Player Player1 does not show cards.Bets: 550. Collects: 0. Loses: 550.
Player Player2 does not show cards.Bets: 50. Collects: 0. Loses: 50.
Player Hero does not show cards.Bets: 50. Collects: 0. Loses: 50.
*Player Player4 mucks (does not show cards). Bets: 550. Collects: 1650. Wins: 1100.
Player Player5 does not show cards.Bets: 50. Collects: 0. Loses: 50.
Player Player6 does not show cards.Bets: 50. Collects: 0. Loses: 50.
Player Player7 does not show cards.Bets: 50. Collects: 0. Loses: 50.
Player Player8 does not show cards.Bets: 300. Collects: 0. Loses: 300.
Game ended at: 2018/1/9 10:14:52

Any help is appreciated. Even just some ideas on how I might go about doing this or what sort of things I should be learning. In my head, the output should look something like this:

HandNumber = 000001
BigBlind = 500
Ante = 50
Players = 8
StackSize = 17595
Hand = 10c7h
Position = 6    # small blind = 1; add 5 since I'm 5 positions removed
Profit = -50

My experience level: I've been taking online courses on Python development, data science, and SQL for about 6 months. I have some familiarity with classes, but not a ton of experience creating my own. I've designed a few programs that help with data extraction from financial statements using regular expressions.


Solution

  • This would be easiest to solve by using a regex to split the different games, and then more regexes to extract the information. I would make a class to keep all this information. Then you can use a db or json to store this info

    def split_file(file_handle):
        pat_str = '''\
    ^Game started at: (?P<game_start>.*?)
    (?P<game>.*?)
    ^------ Summary ------
    (?P<summary>.*)
    ^Game ended at: (?P<game_end>.*)$\
    '''
        pat = re.compile(pat_str, flags=re.MULTILINE|re.DOTALL)
        text = file_handle.read()
        for game in pat.finditer(text):
            yield game
    
    class Pokergame:
        def __init__(self, game_info, playername = 'Hero'):
            self.game_start = datetime.datetime.strptime(game_info['game_start'], "%Y/%m/%d %H:%M:%S")
            self.game_end = datetime.datetime.strptime(game_info['game_end'], "%Y/%m/%d %H:%M:%S")
            self.game_info = _parse_game(game_info['game'], playername)
            self.summary = _parse_summary(game_info['summary'], playername)
    
    def _parse_game(game_str, playername):
        pattern_seat = f'Seat (\d+): {playername} \((\d+)\).'
        seat_match = re.search(pattern=pattern_seat, string=game_str)
        if seat_match:
            seat, stack = seat_match.groups()
        pattern_cards = f'Player {playername} received card: \[(?P<card>\w+)\]'
        cards = tuple(i['card'] for i in re.finditer(pattern_cards, game_str))
    
        result = {
            'seat': seat,
            'stack': stack,
            'cards': cards,
            'text': game_str,
        }
        return result   
    
    def _parse_summary(summary_str, playername):
    
        return summary_str
    
    
    games = []
    with StringIO(hand_text) as file_handle:
        for game_info in split_file(file_handle):
            games.append(Pokergame(game_info))
    

    I've use the StringIO to simulate open(file). You will have to flesh out the __init__,and _parse_... some more, but this should set you on the right track.

    If you have multiple files, you can use itertools.chain to concatenate the games

    games[0].game_info
    
    {'cards': ('10c', '7h'),
     'seat': '3',
     'stack': '17595',
     'text': "Game ID: 1094127759 250/500 $5,000 GTD, ...\nPlayer Player4 mucks cards"}