Search code examples
pythonhtml-tablescreen-scrapingsrc

Python - getting src from a table cell


I have a table as below for which I want to export the text OR the src to a *.csv file.

<table class="GridView plm-table" id="pageLayout_projectTeamMembersGridView_gridView">

<tbody>

<tr id="pageLayout_projectTeamMembersGridView_gridView_headerRow" class="GridViewHeaderRow">
<th class="GridViewHeader" scope="col">A</th>
<th class="GridViewHeader" scope="col">B</th>
<th class="GridViewHeader" scope="col">C</th>
<th class="GridViewHeader" scope="col">D</th>
<th class="GridViewHeader" scope="col">E</th>
<th class="GridViewHeader" scope="col">F</th>
<th class="GridViewHeader" scope="col">G</th>
</tr>

<tr id="pageLayout_projectTeamMembersGridView_DataRow0" class="GridViewRow">
  <td class="GridViewCell" align="right"><input type="checkbox" name="ss" value="zz"></td>
  <td class="GridViewCell"><img class="Icon" src="../../Images/1.png" style="border-width:0px;"></td>
  <td class="GridViewCell">John</td>
  <td class="GridViewCell"><img id="Image0_IDcon" src="../../Images/0.png"></td>
  <td class="GridViewCell"><img id="Image1_IDcon" src="../../Images/1.png"></td>
  <td class="GridViewCell"><img id="Image1_IDcon" src="../../Images/1.png"></td>
  <td class="GridViewCell"><img id="Image0_IDcon" src="../../Images/0.png"></td>
</tr>
<tr id="pageLayout_projectTeamMembersGridView_DataRow1" class="GridViewRow">
  <td class="GridViewCell" align="right"><input type="checkbox" name="ss" value="zz"></td>
  <td class="GridViewCell"><img class="Icon" src="../../Images/1.png" style="border-width:0px;"></td>
  <td class="GridViewCell">Steve</td>
  <td class="GridViewCell"><img id="Image1_IDcon" src="../../Images/1.png"></td>
  <td class="GridViewCell"><img id="Image1_IDcon" src="../../Images/1.png"></td>
  <td class="GridViewCell"><img id="Image0_IDcon" src="../../Images/0.png"></td>
  <td class="GridViewCell"><img id="Image0_IDcon" src="../../Images/0.png"></td>
</tr>
<tr id="pageLayout_projectTeamMembersGridView_DataRow2" class="GridViewRow">
  <td class="GridViewCell" align="right"><input type="checkbox" name="ss" value="zz"></td>
  <td class="GridViewCell"><img class="Icon" src="../../Images/1.png" style="border-width:0px;"></td>
  <td class="GridViewCell">Mary</td>
  <td class="GridViewCell"><img id="Image0_IDcon" src="../../Images/0.png"></td>
  <td class="GridViewCell"><img id="Image1_IDcon" src="../../Images/1.png"></td>
  <td class="GridViewCell"><img id="Image1_IDcon" src="../../Images/1.png"></td>
  <td class="GridViewCell"><img id="Image0_IDcon" src="../../Images/0.png"></td>
</tr>
</tbody>
</table>

What I have done so far is:

table1 = soup.find('table', id = 'pageLayout_projectTeamMembersGrdView_gridView')

headers = []
for i in table1.find_all('th'):
    title = i.text.strip()
    headers.append(title)

df = pd.DataFrame(columns = headers)

for row in table1.find_all('tr')[1:]:
    data = row.find_all('td')
    row_data = [td.text.strip() for td in data]
    length = len(df)
    df.loc[length] = row_data

df.to_csv('Export.csv', index=False)
print("CSV created!")

I'm getting the text value in the 3rd Column (C) but how can I get the src value as "0.png" or "1.png" in the corresponding columns (A, B, D, E and F)?

This is what I get:

This is what I want:


Solution

  • The problem in the following code

    data = row.find_all('td')
    row_data = [td.text.strip() for td in data]
    length = len(df)
    df.loc[length] = row_data
    

    is that td can have a text element, an img or some other element and you're not checking that.

    You can do something like

    for row in table1.find_all('tr')[1:]:
        data = row.find_all('td')
        row_data = []
        for td in data:
            if (td.find("img")):
                row_data.append(td.img.attrs.get('src').split("/")[-1])
            else:
                row_data.append(td.text)
        length = len(df)
        df.loc[length] = row_data
    

    This will output

    A,B,C,D,E,F,G
    ,1.png,John,0.png,1.png,1.png,0.png
    ,1.png,Steve,1.png,1.png,0.png,0.png
    ,1.png,Mary,0.png,1.png,1.png,0.png
    
    

    And A column is empty as expected since it only contains input type. But you can probably handle that case as well.