Search code examples
pythonregexpandasdataframeseries

Adding a column with values based on extracted date from filename (Length of values (1) does not match length of index (50))


i am struggling with this issue for now some time and would love to have your thoughts how to solve it.

I have some files where i need to split it. I read that glob is finally a good practice to do so. After splitting the files i am able to read it to my pandas dataframe. i am parsing with regex the date and want to hand over this to a new column. My Problem is that the length of the dataframe is different to the length of the parsed date. I tried different approaches with lambda and list comprehension, but as i am not used to it i have obviouls problems to get the right codeline.

What i do not understand is, if i take e.g.

df['date'] = 1

it fills the series with 1 appending the dataframe. But it has not the same behavior when it is provided with a variable. Some how weird for me. I read some issues here which are going in the same direction but was not able to adapt it to my problem.

import glob
import pandas as pd
import re


filelist = glob.glob('./wso-meistdiskutiert/*meistdiskutiert')
type(filelist)

for f in filelist:
    df_tmp = pd.read_html(f, decimal='.', thousands='.')[1]
    date = re.findall('\d+', f )
    df_tmp['date = '] = date
    df = df.append(df_tmp)

Solution

  • Ok, i found the problem. I am handing over in the variable date a list with one value. Pandas tries obviously to iterate through the list and needs to have the same length as the dataframe. As this is not the case you will get the error. I am now taking out of the list the str, which works fine.

    for f in filelist:
        df_tmp = pd.read_html(f, decimal='.', thousands='.')[1]
        datetime = re.findall('\d+', f )
        print('datetime is type = ', type(datetime))
        datetime = datetime[0] #<-- taking out from list the needed string
        df_tmp.insert(11, "date", datetime) 
        display(df_tmp)