Search code examples
pythonpandasfilterextract

Python filtering and extract text


I am pretty new to coding and lately I chanced upon something which I wanted to try solving with Python. Below is the text content of which I wanted to query, extract certain fields into a new file. The text content is repetitive and can go up to several thousands of line. Currently, I am only able to parse and output the first two columns which still look wrong. Hope to seek some guidance here. Cheers!

Original TXT File:


Classroom arrangement : 1A-1
(Student Name: Jess, Subject: EC001, Time: 9am - 10am)
(Student Name: Whit, Subject: EC001, Time: 9am - 10am)
(Student Name: Jon, Subject: EC0011, Time: 11am - 12pm)
(Student Name: Kevin, Subject: EC011, Time: 11am - 12pm)
(Student Name: Jess, Subject: EC011, Time: 11am - 12pm)


Classroom arrangement : 1A-2
(Student Name: Jess, Subject: EC002, Time: 11am - 12pm)
(Student Name: Whit, Subject: EC002, Time: 11am - 12pm)
(Student Name: Jon, Subject: EC002, Time: 11am - 12pm)
(Student Name: Kevin, Subject: EC002, Time: 11am - 12pm)
(Student Name: Claire, Subject: EC011, Time: 2pm - 3pm)
(Student Name: Joshua, Subject: EC0011, Time: 2pm - 3pm)
(Student Name: Florence, Subject: EC011, Time: 2pm - 3pm)
(Student Name: Neil, Subject: EC011, Time: 2am - 3pm)

Intended Output:


Classroom: 1A-1, Jess, Subject: EC001 Time: 9am - 10am, Subject: EC011, Time: 11am - 12pm
Classroom: 1A-1, Whit, Subject: EC001 Time: 9am - 10am
Classroom: 1A-1, Jon, Subject: EC0011 Time: 11am - 12pm
Classroom: 1A-1, Kevin, Subject: EC011 Time: 11am - 12pm
Classroom: 1A-2, Jess, Subject: EC002 Time: 11am - 12pm
Classroom: 1A-2, Jon, Subject: EC002, Time: 11am - 12pm
Classroom: 1A-2, Whit, Subject: EC002 Time: 11am - 12pm
Classroom: 1A-2, Kevin, Subject: EC002, Time: 11am - 12pm
Classroom: 1A-2, Claire, Subject: EC011, Time: 2pm - 3pm
Classroom: 1A-2, Joshua, Subject: EC0011, Time: 2pm - 3pm
Classroom: 1A-2, Florence, Subject: EC011, Time: 2pm - 3pm
Classroom: 1A-2, Neil, Subject: EC011, Time: 2am - 3pm

I tried passing readlines into modules before performing an output in the console, but it seems really wrong because I need the Class 1A-1 preceding on each line.

Current Output:


Class 1A-1
Jess, Subject: EC001 Time: 9am - 10am
Jess, Subject: EC011, Time: 11am - 12pm
Whit, Subject: EC001 Time: 9am - 10am
Jon, Subject: EC0011 Time: 11am - 12pm
Kevin, Subject: EC011 Time: 11am - 12pm


Solution

  • Here's your solution. You'll need to tweak the values for the input/output filepaths:

    classroom.py

    import collections
    
    
    def ingest(infilepath):
        """
        Read all the input from the input file.
        Store it in a dictionary so that we can parse it out later.
        We'll use a collections.defaultdict to make life easier
            {classroom name: {student name: [classes...]} }
                key'd by student name since a student can have multiple courses in a classroom
        """
        answer = collections.defaultdict(lambda: collections.defaultdict(list))
        with open(infilepath) as infile:
            classes = infile.read().split('\n\n')  # divide the input into blocks of classrooms
            classes = [c.strip() for c in classes]  # strip out any extra whitespace
    
        for classblock in classes:
            name, *records = classblock.splitlines()  # student records per classroom
            name = name.split(':',1)[-1].strip()
            for record in records:
                record = record.replace("(", "").replace(")", '')  # strip out the "()". We don't need that
                kvs = record.split(',')
    
                d = dict(kv.split(":") for kv in kvs)
                d = {k.strip():v.strip() for k,v in d.items()}
    
                answer[name][d['Student Name']].append(d)
    
        return answer
    
    
    def output(outfilepath, data):
        order = ("Subject", "Time")  # the order in which we want to write the output
        with open(outfilepath, 'w') as outfile:
            for classname, d in data.items():
                for studentname, L in d.items():
                    outfile.write(f"Classroom: {classname}, {studentname}, ")
                    out = []  # maintain the line output in a list. We'll join everything up later
                    for d in L:
                        for k in order:
                            out.append(f"{k}: {d[k]}, ")
    
                    out = ''.join(out)  # this is the file output
                    out = out.strip().rstrip(',')  # strip out the trailing ','
                    outfile.write(f'{out}\n')
    
    
    if __name__ == "__main__":
        print('starting')
    
        data = ingest('path/to/input/file')
        output('path/to/output/file', data)
    
        print('done')
    

    I used this input (notice the blank lines at the start of the file):

    
    
    Classroom arrangement : 2A-1
    (Student Name: Jess, Subject: EC001, Time: 9am - 10am)
    (Student Name: Whit, Subject: EC001, Time: 9am - 10am)
    (Student Name: Jon, Subject: EC0011, Time: 11am - 12pm)
    (Student Name: Kevin, Subject: EC011, Time: 11am - 12pm)
    (Student Name: Jess, Subject: EC011, Time: 11am - 12pm)
    
    
    Classroom arrangement : 1A-2
    (Student Name: Jess, Subject: EC002, Time: 11am - 12pm)
    (Student Name: Whit, Subject: EC002, Time: 11am - 12pm)
    (Student Name: Jon, Subject: EC002, Time: 11am - 12pm)
    (Student Name: Kevin, Subject: EC002, Time: 11am - 12pm)
    (Student Name: Claire, Subject: EC011, Time: 2pm - 3pm)
    (Student Name: Joshua, Subject: EC0011, Time: 2pm - 3pm)
    (Student Name: Florence, Subject: EC011, Time: 2pm - 3pm)
    (Student Name: Neil, Subject: EC011, Time: 2am - 3pm)
    

    I got this output:

    Classroom: 1A-1, Jess, Subject: EC001, Time: 9am - 10am, Subject: EC011, Time: 11am - 12pm
    Classroom: 1A-1, Whit, Subject: EC001, Time: 9am - 10am
    Classroom: 1A-1, Jon, Subject: EC0011, Time: 11am - 12pm
    Classroom: 1A-1, Kevin, Subject: EC011, Time: 11am - 12pm
    Classroom: 1A-2, Jess, Subject: EC002, Time: 11am - 12pm
    Classroom: 1A-2, Whit, Subject: EC002, Time: 11am - 12pm
    Classroom: 1A-2, Jon, Subject: EC002, Time: 11am - 12pm
    Classroom: 1A-2, Kevin, Subject: EC002, Time: 11am - 12pm
    Classroom: 1A-2, Claire, Subject: EC011, Time: 2pm - 3pm
    Classroom: 1A-2, Joshua, Subject: EC0011, Time: 2pm - 3pm
    Classroom: 1A-2, Florence, Subject: EC011, Time: 2pm - 3pm
    Classroom: 1A-2, Neil, Subject: EC011, Time: 2am - 3pm
    

    Hope this helps