Search code examples
pythonpandasdataframefileanalytics

How to transfer data from STDF file to Pandas dataframe in python


I have a data flowing in from STDF file format , which is testing machines output file format used by semiconductor manufacturing industry I need to read the file in python and analyze machine output downtime and other details uploaded in the file I googled for solutions in Github and other platform , there is no bug free modules available in python and also not documented properly to implement the codes with the existing modules


Solution

  • I suggest pystdf.

    From my experience, that library is completely bug-free although the performance is somewhat slow on big files. And you'll still have to understand and sort through all the records for data analysis purposes.

    Sample use below (this snippet reads multiple stdf files into pandas dataframes for each record type).

    import os
    import pandas as pd
    from io import StringIO
    import pystdf.V4 as v4
    from pystdf.IO import Parser
    from pystdf.Writers import TextWriter
    
    
    def stdf_to_dfs(filelist):
        ''' Takes a list of stdf files, and returns individual dataframes for each record type, separated per file.
        Also, prepends the line number from the atdf (as well as the source file).'''
    
        record_dfs = {}
        for file in filelist:
            filename = os.path.basename(file)
            p = Parser(inp=open(file, 'rb'))
            captured_std_out = StringIO()
            p.addSink(TextWriter(captured_std_out))
            p.parse()
            atdf = captured_std_out.getvalue()
    
            # prepend line number and source file name to captured_std_out so it can be sorted later
            # line number is 2nd field... 1st field is record_type
            atdf = atdf.split('\n')
            for n, l in enumerate(atdf):
                atdf[n] = l[:4] + str(n) + '|' + filename + '|' + l[4:]
    
            # read each record type into a seperate dataframe
            for record_type in v4.records:
                record_name = record_type.name.split('.')[-1].upper()
                curr = [line for line in atdf if line.startswith(record_name)]
                curr = '\n'.join(curr)
                if curr not in '':
                    header_names = ['Record', 'LineNum', 'SourceFile'] + list(list(zip(*record_type.fieldMap))[0])
                    if record_name not in record_dfs:
                        record_dfs[record_name] = pd.DataFrame()
                    record_dfs[record_name] = pd.concat([record_dfs[record_name], pd.read_csv(
                        StringIO(curr), header=None, names=header_names, delimiter='|')])
    
        # drop empty record dataframes
        record_dfs = {k: v for k, v in record_dfs.items() if (v is not None)}
    
        return record_dfs