Search code examples
pythonpython-3.xpandasdataframedata-analysis

row values get replaced after new iteration


I have a function that looks like this and I am running it in a for loop:

def findInfo(url, df):
    allLinks = getAllLinks(url)
    katalogLinks = getKatalogLinks(allLinks)
    if len(katalogLinks) == 0:
        df = df.append({'Company URL' : url,
                    'Potential Client' : 0} , 
                    ignore_index=True)
        return df

    else:
        print("catalog links foud", url)
        df["Company URL"] = url
        df["Potential Client"] = 1
        pdfLinks = getPDFLinks(katalogLinks)
        print(pdfLinks)
        
        pdfDetails = checkPDFs(url, pdfLinks)
        df = df.append({'Company URL' : url,
                    'Potential Client' : 1, "Number of PDFs found":len(pdfLinks),"Info":pdfDetails} , 
                    ignore_index=True) 
        return df
df = pd.DataFrame()
df["Company URL"] = ""
df["Potential Client"] = ""
lst = ["http://www.aurednik.de/", "https://www.eltako.de/"]
for i in lst:
    df = findInfo(i, df)
    print("DF", df)

df.head()

For the first iteration, when I print the df in the loop, I get correct results

DF                Company URL  Potential Client Info  Number of PDFs found
0  http://www.aurednik.de/                 1   {}                   0.0

However, for the second iteration, I wanted the first row to stay as it is and then add another row returned from the df. However, the url in the first df gets replaced and my final df is like this:

Company URL Potential Client    Info    Number of PDFs found
0   https://www.eltako.de/  1   {}  0.0
1   https://www.eltako.de/  1   {'https://www.eltako.de/wp-content/uploads/2020/11/Eltako_Gesamtkatalog_LowRes.pdf': {'numberOfPages': 440, 'creationDate': '2017-09-20'}}  1.0

Why is the first row being replaced? How can I fix this? This has probably to do something with how I save or return df but I cannot figure out the issue.


Solution

  • In lines 12-13:

            df["Company URL"] = url
            df["Potential Client"] = 1
    

    you set the whole columns "Company URL" and "Potential Client" to the values of the current iteration. Removing those lines should do the trick.