Parse the datas and make those group header. Python3, pandas

I am parsing .csv file(you can see the example file here.) I am extracting data of 2nd and 7th rows. No problem with that. this is how I am doing it.

import pandas as pd
import numpy as np

df = pd.read_csv("datas.csv", index_col=0, header=None)
d = {'YSS':'Yahoo!リスティング 12月分 12/1〜12/31',
     'YDNRT':'Yahoo!リマーケティング 12月分 12/1〜12/31',
     'YDN':' Yahoo!ディスプレイネットワーク 12月分 12/1〜12/31',
     'GSN':'Googleリスティング 12月分 12/1〜12/31',
     'GDNRM':'Googleリマーケティング 12月分 12/1〜12/31',
     'GDN':'Googleディスプレイネットワーク 12月分 12/1〜12/31'}

pat = r'({})'.format('|'.join(d.keys()))
df.loc['アカウント名'] = df.loc['アカウント名'].str.extract(pat, expand=False).dropna().map(d)
df.loc['利用額(Fee抜き)'] = df.loc['利用額(Fee抜き)'].astype(str).apply(lambda x: x.split(".")[0])

df1 = df.loc[['アカウント名', '利用額(Fee抜き)']]
df1 = df1.T

df1.columns = ['項目','金額']

df1['数量'] = 1

df1['単位'] = "式"

df1['単価'] = np.nan

wow = df1[['項目','数量','単位','単価', '金額']]

newFile = wow.shift(1)

newFile['項目'] = newFile['項目'].fillna(df.loc['クライアント名'])

newFile.loc[newFile['項目'].str.contains('プレサンス'),['数量','単位','単価', '金額']] = ['','','','']

pos = newFile.index[newFile['項目'].str.contains('プレサンス')]

d = {}
i = 0
for p in pos:
    if p == pos[0]:
        d[p] = newFile.loc[:pos[i+1]-1].append(pd.Series('',newFile.columns), ignore_index=True)
    elif (i + 1) > len(pos) - 1:
        d[p] = newFile.loc[pos[i-1]+1:]
    else:
        d[p] = newFile.loc[p:pos[i+1]-1].append(pd.Series('',newFile.columns), ignore_index=True)
    i = i + 1
pd.concat(d, ignore_index=True)
p.to_csv('newfile.csv', index=False)

Creating new .csv file with new columns. you can see it here. https://i.sstatic.net/PVUjM.jpg But I need to do one more thing.

In the orginal file's row 1 has company names. I want to parse those company names and put those on head of each group like this in the image: https://i.sstatic.net/kSIeS.jpg also need to delete total sums too...

I am not very sure is it possible or not though...

Solution

You can replace the NaN for the column '項目' by indexing the original df and calling fillna, and then filter the lines containing the string 'プレサンス' and overwrite the row values with a list of empty strings, firstly we shift the rows down 1 so it makes the header:

In[111]:

newFile = df1.shift(1)
newFile['項目'] = newFile['項目'].fillna(df.loc['クライアント名'])
newFile.loc[newFile['項目'].str.contains('プレサンス'),['数量','単位','単価', '金額']] = ['','','','']
newFile
Out[111]: 
                                     項目      金額 数量 単位   単価
1                        プレサンス ロジェ 和泉中央                   
2      Yahoo!リスティング 12月分 12/1〜12/31 YSS   91188  1  式  NaN
3        Yahoo!リマーケティング 12月分 12/1〜12/31   25649  1  式  NaN
4    Yahoo!ディスプレイネットワーク 12月分 12/1〜12/31   13211  1  式  NaN
5          Googleリスティング 12月分 12/1〜12/31  131742  1  式  NaN
6        Googleリマーケティング 12月分 12/1〜12/31   35479  1  式  NaN
7    Googleディスプレイネットワーク 12月分 12/1〜12/31   18999  1  式  NaN
8                          プレサンス グラン 茨木                   
9      Yahoo!リスティング 12月分 12/1〜12/31 YSS  113373  1  式  NaN
10       Yahoo!リマーケティング 12月分 12/1〜12/31   28775  1  式  NaN
11   Yahoo!ディスプレイネットワーク 12月分 12/1〜12/31   19010  1  式  NaN
12         Googleリスティング 12月分 12/1〜12/31  158389  1  式  NaN
13       Googleリマーケティング 12月分 12/1〜12/31   45530  1  式  NaN
14   Googleディスプレイネットワーク 12月分 12/1〜12/31   23224  1  式  NaN
15                         プレサンス ロジェ 江坂

Now as you want to add padding so it makes it more readable we can store the index locations of where the totals are, then iterate over these and slice the df, add these to a dict and then call concat to vertically stack the padded slices:

In[112]:

pos = newFile.index[newFile['項目'].str.contains('プレサンス')]
pos
Out[112]: Int64Index([1, 8, 15], dtype='int64')

now create a dict of each slice and append an empty row:

In[115]:

d = {}
i = 0
for p in pos:
    if p == pos[0]:
        d[p] = newFile.loc[:pos[i+1]-1].append(pd.Series('',newFile.columns), ignore_index=True)
    elif (i + 1) > len(pos) - 1:
        d[p] = newFile.loc[pos[i-1]+1:]
    else:
        d[p] = newFile.loc[p:pos[i+1]-1].append(pd.Series('',newFile.columns), ignore_index=True)
    i = i + 1
pd.concat(d, ignore_index=True)
Out[115]: 
                                     項目      金額 数量 単位   単価
0                        プレサンス ロジェ 和泉中央                   
1      Yahoo!リスティング 12月分 12/1〜12/31 YSS   91188  1  式  NaN
2        Yahoo!リマーケティング 12月分 12/1〜12/31   25649  1  式  NaN
3    Yahoo!ディスプレイネットワーク 12月分 12/1〜12/31   13211  1  式  NaN
4          Googleリスティング 12月分 12/1〜12/31  131742  1  式  NaN
5        Googleリマーケティング 12月分 12/1〜12/31   35479  1  式  NaN
6    Googleディスプレイネットワーク 12月分 12/1〜12/31   18999  1  式  NaN
7                                                         
8                          プレサンス グラン 茨木                   
9      Yahoo!リスティング 12月分 12/1〜12/31 YSS  113373  1  式  NaN
10       Yahoo!リマーケティング 12月分 12/1〜12/31   28775  1  式  NaN
11   Yahoo!ディスプレイネットワーク 12月分 12/1〜12/31   19010  1  式  NaN
12         Googleリスティング 12月分 12/1〜12/31  158389  1  式  NaN
13       Googleリマーケティング 12月分 12/1〜12/31   45530  1  式  NaN
14   Googleディスプレイネットワーク 12月分 12/1〜12/31   23224  1  式  NaN
15                                                        
16     Yahoo!リスティング 12月分 12/1〜12/31 YSS  113373  1  式  NaN
17       Yahoo!リマーケティング 12月分 12/1〜12/31   28775  1  式  NaN
18   Yahoo!ディスプレイネットワーク 12月分 12/1〜12/31   19010  1  式  NaN
19         Googleリスティング 12月分 12/1〜12/31  158389  1  式  NaN
20       Googleリマーケティング 12月分 12/1〜12/31   45530  1  式  NaN
21   Googleディスプレイネットワーク 12月分 12/1〜12/31   23224  1  式  NaN
22                         プレサンス ロジェ 江坂