Search code examples
pythonpandasdataframenormalization

Delete numbers smaller then 3 digits in a list while amount of items stays the same


I want to normalize my list containing years. It is important that the amount of items in the list stay the same, because I'm going to convert the list to a dataframe and the rows need to allign with the other variables. This is the list I have. It contains many different ways to notate the year:

['1817 (1817p)', '1800-1824 (19.1q)', '1825-1849', 'ca. 1850', '1856–60', '1861-07-XX', 'copied between 1824 and 1845', 'copied d. 14tn Merz 1767', '1718']

Now, I would like to get only 1 year per item in the list. For example:

['1817', '1800', '1825', '1850', '1856', '1861', '1824', '1767', '1718']

If there are two years in 1 item, then choose the first year. (Bonus points if you could get the mean if there are 2 items in a list.)

In order to get te desired result, I removed everything within brackets and replaced "-" with spaces.

import re

data2 = []

for i in data:
    df8 = re.sub(r"\([^()]*\)", "", i)
    df10 = re.sub((r'\–'), " ", df8)
    df11 = re.sub((r'\-'), " ", df10)
    data2 += [df11]
print(data2)

Output 1:

['1817 ', '1800 1824 ', '1825 1849', 'ca. 1850', '1856 60', '1861 07 XX', 'copied between 1824 and 1845', 'copied d. 14tn Merz 1767', '1718']

Then I iterated through the items, but I end up with more items in the list than at the beginning.

ls = data2
ls2 = []
 
for i in ls:
    res = re.findall(r'\w+', i)
    for w in res:
        if len(w) > 3:
            ls2.append(w)
print(ls2)

Output 2:

['1817', '1800', '1824', '1825', '1849', '1850', '1856', '1861', 'copied', 'between', '1824', '1845', 'copied', '14tn', 'Merz', '1767', '1718']

Solution

  • What I can think of is using a combination of regex and numpy modules:

    import re
    import numpy as np
    myList = ['1817 (1817p)', '1800-1824 (19.1q)', '1825-1849', 'ca. 1850', '1856–60', '1861-07-XX', 'copied between 1824 and 1845', 'copied d. 14tn Merz 1767', '1718']
    [np.array(re.findall("\d{4}",x)).astype("int").mean() for x in myList]
    

    Output

    [1817.0, 1812.0, 1837.0, 1850.0, 1856.0, 1861.0, 1834.5, 1767.0, 1718.0]
    

    This actually gives you the mean of the numbers in each element of the list.