Search code examples
listdataset

I'm very new to python and don't know how to put a dataset into two different lists


So I am given a dataset(student number, first name, last name, date of birth, study program) and with this I have to create a program that processes this data and puts them in one of two lists: valid data and corrupted data. Sometimes data values are corrupted and the program must report corrupted values. Any invalid or empty value is defined as corrupted.

  • Student number has this format: 7 digits, starting with 0 and second digit (from left) can be either 9 or 8. Example: 0212345 is not valid

  • First name and last names, contains only alphabet. Date of birth has this format: YYYY-MM-DD. Days between 1 and 31, months between 1 and 12 and Years between 1960 and 2004.

  • Study program can have one of these values: INF, TINF, CMD, AI.

I also have a csv file with the dataset which looks like this:

0893527,Ruggiero,Fifield,1976-08-18,DS
0944991,Vanny,Jerromes,1996-08-10,TINF
0959490,Abbe,Trees,1986-11-29,DS

This obviously is not the entire list, but the rest looks exactly the same.

I really need help with this since I'm getting nowhere. Any help and/or tips are appreciated

This is the code that I already have made:

import os
import sys

valid_lines = []
corrupt_lines = []



def validate_data(line):
    pass

def main(csv_file):
    with open(os.path.join(sys.path[0], csv_file), newline='').readlines() as csv_file:

        next(csv_file)

        for line in csv_file:
            validate_data(line.strip())
            for digits in csv_file:
               if csv_file[1] != (8,9):
                   print('')



    print('### VALID LINES ###')
    print("\n".join(valid_lines))
    print('### CORRUPT LINES ###')
    print("\n".join(corrupt_lines))


if __name__ == "__main__":    
    main('students.csv')

Solution

  • You can try to use re module to validate number, names. For a date you can use str.split. For a valid program you can use set:

    import re
    import csv
    
    valid, corrupted = [], []
    
    pat_number = re.compile(r"^0[89]\d{5}$")
    pat_names = re.compile(r"^[a-zA-Z]+$")
    
    valid_programs = {"INF", "TINF", "CMD", "AI"}
    
    with open("your_data.csv", "r") as f_in:
        reader = csv.reader(f_in)
        for row in reader:
            number, first_name, last_name, date, program = row
    
            match = pat_number.search(number)
            if not match:
                print(f"{number=} invalid")
                corrupted.append(row)
                continue
    
            match = pat_names.search(first_name)
            if not match:
                print(f"{first_name=} invalid")
                corrupted.append(row)
                continue
    
            match = pat_names.search(last_name)
            if not match:
                print(f"{last_name=} invalid")
                corrupted.append(row)
                continue
    
            try:
                y, m, d = map(int, date.split("-"))
    
                if y < 1960 or y > 2004:
                    print(f"{y=} invalid")
                    corrupted.append(row)
                    continue
    
                if m < 1 or m > 12:
                    print(f"{m=} invalid")
                    corrupted.append(row)
                    continue
    
                if d < 1 or d > 31:
                    print(f"{d=} invalid")
                    corrupted.append(row)
                    continue
            except:
                print(f"{date=} invalid")
                corrupted.append(row)
                continue
    
            if program not in valid_programs:
                print(f"{program=} invalid")
                corrupted.append(row)
                continue
    
            valid.append(row)
    
    print(f"{valid=}")
    print("-" * 80)
    print(f"{corrupted=}")
    

    Prints:

    program='DS' invalid
    program='DS' invalid
    valid=[['0944991', 'Vanny', 'Jerromes', '1996-08-10', 'TINF']]
    --------------------------------------------------------------------------------
    corrupted=[['0893527', 'Ruggiero', 'Fifield', '1976-08-18', 'DS'], ['0959490', 'Abbe', 'Trees', '1986-11-29', 'DS']]