Search code examples
pythonpython-3.xpandasdataframepycharm

Pandas python merge columns


i`m trying to merge a column from another .xlsx file to another file.

This is the column(repoPath) that i want to add from another file: repoPath from another file

And this is the file where i want to add it:

Where to add "column C"

And this is my script:

from svnscripts.timestampdirectory import  createdir,path_dir
import os
import time

def gitrepolastchangedate():
    pathdest=path_dir()
    dest=createdir()
    timestr = time.strftime("%Y-%m-%d")
    #below you read the file from the declared path

    ###processing FOR "TCDD"
    df = pd.read_csv(rf"{pathdest}\{timestr}-rawGitData-reposLastChangeDate-tcdd.csv", sep='|', header=None)
    dfGitReposSize=pd.read_excel(os.path.join(dest,"GitReposSize.xlsx"))
    df.columns = ['repoPath']
    df[['repoPath', 'repoLastChangeDate']] = df["repoPath"].str.split(";", 1, expand=True)
    df["repoLastChangeDate"] = df["repoLastChangeDate"].str.strip("-")
    df.to_excel(os.path.join(dest, "GitRepoLastChangeDate.xlsx"), index=False)
    Result = pd.merge(
        df,
        dfGitReposSize[
            [
                'repoPath',


            ]
        ],
        left_on='repoPath',
        right_on='repoPath'
    )
    Result.fillna("N/A", inplace=True)
    print(Result)
    #Result.to_excel(os.path.join(dest, "GitRepoLastChangeDate.xlsx"), index=False)
gitrepolastchangedate()

The idea is that im trying to add the 'repoPath' from the other file comparing the keys, but idea is that the keys isn`t exactly the same. Because on file that i want to add the keys contains in the begining that : /srv/git +key

So what im thinking its the right key should contains a str{/srv/git} +key or idk..

The output that i receive is: Its not even creating the 3rd column...

output

Example for both file.xlsx:

file1:

repoName             repoSize        repoPath
./m4r_media-sync.git    812K    tcdd/m4r_media-sync.git
./m4r_isd-software.git  538M    tcdd/m4r_isd-software.git
./ext_oam-updater.git   2.5M    tcdd/ext_oam-updater.git
./build.git 9.8M    tcdd/build.git
./m4r_isd-adminapp.git  3.5M    tcdd/m4r_isd-adminapp.git
./m4r_moving-map.git    4.5M    tcdd/m4r_moving-map.git
./m4r_inet-portal.git   332K    tcdd/m4r_inet-portal.git

FILE WHERE TO ADD:

repoPath    repoLastChangeDate
/srv/git/tcdd/build.git 05/23/2022
/srv/git/tcdd/ext_apache2.2-fix.git 10/18/2016
/srv/git/tcdd/ext_oam-updater.git   05/19/2022
/srv/git/tcdd/m4r_cctv-trx.git  07/28/2017
/srv/git/tcdd/m4r_ffmpeg-tcdd.git   08/04/2016
/srv/git/tcdd/m4r_fleetstat.git 09/11/2020

so the column "repoPath" from file 1 needs to be add on file 2 on column C having the name (repoName) for the coresponding row of this file


Solution

  • As I see all your rows for repoPath in the first dataframe contains prefix /srv/git. We can merge both the dataframe on repoPath excluding the prefix /srv/git/

    df1 = pd.read_excel(r"GitReposSize.xlsx")
    df2 = pd.read_excel(r"GitRepoLastChangeDate.xlsx")
    df1['newrepoPath'] = df1['repoPath'].map('/srv/git/{}'.format)
    df1 = pd.merge(df1,df2, left_on='newrepoPath',right_on='repoPath', how='left').drop(['newrepoPath'], axis=1)
    

    Saving as excel

    df1.to_excel(r'FinalResult.xlsx', index = False)