I am attempting to merge two cells together. The reason for this is due to the fact that every unit under 'Chassis' should be an alphanumeric (ABCD123456) however the PO provided occasionally shifts the last number to the next row (no other data on said row) making the data look like this Example I initially tried to create a statement that looked at the cell, confirmed it was less one number, then would look at the next cell, and merge the two. Never got that to even come close to manifesting any results. I then decided to replicate the data frame, shift the second data frame(so the missing number is on the same row), and merge them together. This is where I am now. Error Msg This is my first real bit of code in Python so I am fairly certain I am doing inefficient things so by all means let me know where I can improve.
Currently I have this...
Col1 | Chassis | Other Columns... | Other Columns 2... |
---|---|---|---|
Nan | ABCD12345 | ABC | 123 |
Nan | 6 | Nan | Nan |
Nan | WXYZ987654 | GHI | 456 |
Nan | QRSTU654987 | Nan | 789 |
Nan | MNOP999999 | XYZ | Nan |
End Goal is this...
Col1 | Chassis | Other Columns... | Other Columns 2... |
---|---|---|---|
Nan | ABCD123456 | ABC | 123 |
Nan | WXYZ987654 | GHI | 456 |
Nan | QRSTU654987 | Nan | 789 |
Nan | MNOP999999 | XYZ | Nan |
import PyPDF2 as pdf2
import tabula as tb
import pandas as pd
import re
import csv
import os
os.listdir()
pd.set_option('display.max_columns', None)
#bring in pdf, remove first page, convert to csv
PO = 'PO.pdf'
pages = open(PO, 'rb')
readPDF = pdf2.PdfFileReader(pages)
totalpages = readPDF.numPages
x = '2-' + str(totalpages)
POCSV = tb.convert_into(PO, 'POCSV.csv', output_format = 'csv', pages = x)
#Convert column to string, create second data frame, shift said data frame up 1
df = pd.read_csv('POCSV.csv')
df['Chassis'] = df['Chassis'].astype(str)
dfshift = df.shift(-1)
dfshift.rename(columns=({'Chassis': 'Chassis Shifted'}), inplace = True,)
dfMerged = pd.concat([df, dfshift], axis=1)
#For each row combine rows, create new column
for ind, row in df.iterrows():
dfMerged.loc[ind, 'Complete Chassis'] = row['Chassis'] + row["Chassis Shifted"]
print(dfMerged['Complete Chassis'])
Create a virtual group and merge rows of this group for Chassis
column:
# Convert 'NaN' string to pd.NA
df = df.replace('Nan', pd.NA)
cols = df.columns.difference(['Chassis'])
m = df[cols].any(1)
df = df.assign(Chassis=df.groupby(m.cumsum())['Chassis'] \
.transform('sum')).loc[m].reset_index(drop=True)
print(df)
# Output
Col1 Chassis Other Columns Other Columns 2
0 <NA> ABCD123456 ABC 123
1 <NA> WXYZ987654 GHI 456
2 <NA> QRSTU654987 <NA> 789
3 <NA> MNOP999999 XYZ <NA>
Setup:
import pandas as pd
import numpy as np
data = {'Col1': ['Nan', 'Nan', 'Nan', 'Nan', 'Nan'],
'Chassis': ['ABCD12345', '6', 'WXYZ987654', 'QRSTU654987', 'MNOP999999'],
'Other Columns': ['ABC', 'Nan', 'GHI', 'Nan', 'XYZ'],
'Other Columns 2': ['123', 'Nan', '456', '789', 'Nan']}
df = pd.DataFrame(data)
print(df)
# Output
Col1 Chassis Other Columns Other Columns 2
0 Nan ABCD12345 ABC 123
1 Nan 6 Nan Nan
2 Nan WXYZ987654 GHI 456
3 Nan QRSTU654987 Nan 789
4 Nan MNOP999999 XYZ Nan