I want to normalize my list containing years. It is important that the amount of items in the list stay the same, because I'm going to convert the list to a dataframe and the rows need to allign with the other variables. This is the list I have. It contains many different ways to notate the year:
['1817 (1817p)', '1800-1824 (19.1q)', '1825-1849', 'ca. 1850', '1856–60', '1861-07-XX', 'copied between 1824 and 1845', 'copied d. 14tn Merz 1767', '1718']
Now, I would like to get only 1 year per item in the list. For example:
['1817', '1800', '1825', '1850', '1856', '1861', '1824', '1767', '1718']
If there are two years in 1 item, then choose the first year. (Bonus points if you could get the mean if there are 2 items in a list.)
In order to get te desired result, I removed everything within brackets and replaced "-" with spaces.
import re
data2 = []
for i in data:
df8 = re.sub(r"\([^()]*\)", "", i)
df10 = re.sub((r'\–'), " ", df8)
df11 = re.sub((r'\-'), " ", df10)
data2 += [df11]
print(data2)
Output 1:
['1817 ', '1800 1824 ', '1825 1849', 'ca. 1850', '1856 60', '1861 07 XX', 'copied between 1824 and 1845', 'copied d. 14tn Merz 1767', '1718']
Then I iterated through the items, but I end up with more items in the list than at the beginning.
ls = data2
ls2 = []
for i in ls:
res = re.findall(r'\w+', i)
for w in res:
if len(w) > 3:
ls2.append(w)
print(ls2)
Output 2:
['1817', '1800', '1824', '1825', '1849', '1850', '1856', '1861', 'copied', 'between', '1824', '1845', 'copied', '14tn', 'Merz', '1767', '1718']
What I can think of is using a combination of regex and numpy modules:
import re
import numpy as np
myList = ['1817 (1817p)', '1800-1824 (19.1q)', '1825-1849', 'ca. 1850', '1856–60', '1861-07-XX', 'copied between 1824 and 1845', 'copied d. 14tn Merz 1767', '1718']
[np.array(re.findall("\d{4}",x)).astype("int").mean() for x in myList]
[1817.0, 1812.0, 1837.0, 1850.0, 1856.0, 1861.0, 1834.5, 1767.0, 1718.0]
This actually gives you the mean of the numbers in each element of the list.