Search code examples
pythondataframetextgraphextract

How to plot a graph from a txt file that is not properly formatted?


I need to plot a graph from data in a given text file, but the text file is not really properly formatted for easy extraction of data. Is there some way to automate the extraction of the data I need into proper lists or dataframe in something like Jupyter notebook with python to plot a graph easily? Here's how the file currently looks like:

    ###########
    Average Hybridsort CPU time, array size 1: 110
    Average Mergesort CPU time, array size 1: 98
    
    Average Hybridsort comparisons, array size 1: 0
    Average Mergesort comparisons, array size 1: 0
    ###########
    Average Hybridsort CPU time, array size 2: 118
    Average Mergesort CPU tim, array size 2: 156
    
    Average Hybridsort comparisons, array size 2: 0
    Average Mergesort comparisons, array size 2: 1
    ###########
    Average Hybridsort CPU time, array size 3: 133
    Average Mergesort CPU time, array size 3: 175
    
    Average Hybridsort comparisons, array size 3: 1
    Average Mergesort comparison, array size 3: 4
    ###########
    Average Hybridsort CPU time, array size 4: 121
    Average Mergesort CPU time, array size 4: 170
    
    Average Hybridsort comparisons, array size 4: 2
    Average Mergesort comparisons, array size 4: 6
    ########### (and so on...)

So there is a clear format and pattern in the file, but I'm not sure how I can extract only the values I need to plot graphs. So I'm trying to extract data from this and plot two graphs: Graph of Hybridsort CPU Time and Mergesort CPU time against array size (so two line plots in one graph), and Graph of Hybridsort comparions and Mergesort comparisons against array size (also two line plots in one graph). I am familiar with jupyter notebook and python, and would prefer to plot them there, but other methods are also welcome. Thanks!


Solution

  • You'll need to do some search and replaces to ensure that all the intended labels are consistently named (some pluralized, some not, etc.)

    Then you can extract all the labels using set().

    Then you can parse out values else in a loop. I've put the data into a dictionary, which you could format differently if easier for plotting.

    txt = """"
    ###########
    Average Hybridsort CPU time, array size 1: 110
    Average Mergesort CPU time, array size 1: 98
    
    Average Hybridsort comparisons, array size 1: 0
    Average Mergesort comparisons, array size 1: 0
    ###########
    Average Hybridsort CPU time, array size 2: 118
    Average Mergesort CPU time, array size 2: 156
    
    Average Hybridsort comparisons, array size 2: 0
    Average Mergesort comparisons, array size 2: 1
    ###########
    Average Hybridsort CPU time, array size 3: 133
    Average Mergesort CPU time, array size 3: 175
    
    Average Hybridsort comparisons, array size 3: 1
    Average Mergesort comparisons, array size 3: 4
    ###########
    Average Hybridsort CPU time, array size 4: 121
    Average Mergesort CPU time, array size 4: 170
    
    Average Hybridsort comparisons, array size 4: 2
    Average Mergesort comparisons, array size 4: 6
    """
    
    data = {}
    labels = set([x.split(",")[0] for x in txt.split("\n") if x.startswith("Average")])
    print(labels)
    for label in labels:
        for line in txt.split("\n"):
            if line.startswith(label):
                line_split = line.split(':')
                x = int([x for x in line_split[0].split() if x.isdigit()][0])
                y = int(line_split[1])
                if label not in data.keys():
                    data[label] = [(x, y)]
                else:
                    data[label].append((x, y))
    print(data)
    

    Output:

    {'Average Hybridsort comparisons', 'Average Mergesort CPU time', 'Average Mergesort comparisons', 'Average Hybridsort CPU time'}
    {'Average Hybridsort comparisons': [(1, 0), (2, 0), (3, 1), (4, 2)], 'Average Mergesort CPU time': [(1, 98), (2, 156), (3, 175), (4, 170)], 'Average Mergesort comparisons': [(1, 0), (2, 1), (3, 4), (4, 6)], 'Average Hybridsort CPU time': [(1, 110), (2, 118), (3, 133), (4, 121)]}