Search code examples
pythonpandasdataframecsv2-digit-year

How can I clean a year column with messy values?


I have a project I'm working on for a data analysis course, where we pick a data set and go through the steps of cleaning and exploring the data with a question to answer in mind.

I want to be able to see how many instances of the data occur in different years, but right now the Year column in the data set is set to a datatype object, with values spanning from whole years like 1998, just the last 2 digits like 87, ranges of presumed years ('early 1990's', '89 or 90', '2011- 2012', 'approx 2001').

I'm trying to determine the best way to convert all these various instances to the proper format or would it be better to drop the values that are not definitive? I worry that this would lead to too much data loss because the dataset is already pretty small (about 5000 rows total).

I have looked into regex and it seems like that is the path I should go down to keep and alter the values, but I still don't understand it conceptually very well, and I worry about the efficiency of filtering for so many different value variations.

I'm still very new to Python and pandas.


Solution

  • Assuming your Year columns are strings, I would write a normalize function like this:

    import re
    import pandas as pd
    
    data = [
        {"year": "early 1990's"},
        {"year": "89 or 90"},
        {"year": "2011-2012"},
        {"year": "approx 2001"},
    ]
    
    def normalize(row):
        year = row["year"]
    
        # Count the number of digits
        count = len(re.findall("\\d", year))
    
        if count == 4:
            # match YYYY
            if m := re.search("\\d\\d\\d\\d", year):
                return m.group(0)
    
        if count == 2:
            # match YY
            if m := re.search("\\d\\d", year):
                return "19" + m.group(0)
    
    df = pd.DataFrame(data)
    df["normalized"] = df.apply(normalize, axis=1)
    print(df)
    
    =>
               year normalized
    0  early 1990's       1990
    1      89 or 90       None
    2     2011-2012       None
    3   approx 2001       2001
    

    The function returns None for unmatched pattern. You can list them as follows:

    >>> print(df[df["normalized"].isnull()])
    ...
            year normalized
    1   89 or 90       None
    2  2011-2012       None
    

    Review the output and modify the normalize function as you like. Repeat these steps until you get satisfied.