I have an extremely long HTML file that I cannot modify but would like to parse for CSV output. Imagine the following code repeated hundreds of times all on the same line. I realize this would be much simpler if there were line breaks, but I have no control over how the file is created. You should also know that there are no friendly line breaks in this code; imagine fully minified code. I have just added breaks so it's easier to visualize. But, any actual solution to this would not be able to rely on line breaks or spaces since they will not exist in reality.
<tr id="link">
<td><a href="https://www.somewebsite.com" target="_target">Title</a></td>
<td>Value 1</td><td style="width:20ch">Value 2</td>
<td></td><td></td><td>Value 3</td>
<td>Value 4</td><td>Value 5</td><td>Value 6</td>
<td>Value 7</td><td>Value 8</td><td>Value 9</td></tr>
My desired output from this is https://www.somewebsite.com, Title, Value 1, Value 2, , , Value 3, ...
(etc.)
Basically, I want to replace all values in tags with commas but retain the URL. I cannot find any way in Python to parse something like this since the scan(), find(), etc. functions in Python do not seem to keep track of the file pointer globally as I'm used to in languages like C. So, no matter what I do I'm continually just looking at the beginning of the line.
from bs4 import BeautifulSoup
html_doc = """
<tr id="link">
<td><a href="https://www.somewebsite.com" target="_target">Title</a></td>
<td>Value 1</td><td style="width:20ch">Value 2</td>
<td></td><td></td><td>Value 3</td>
<td>Value 4</td><td>Value 5</td><td>Value 6</td>
<td>Value 7</td><td>Value 8</td><td>Value 9</td></tr>"""
for tr in BeautifulSoup(html_doc, 'html.parser').find_all('tr'):
row = []
for td in tr.find_all('td'):
anchor = td.find('a')
row.extend([anchor['href'], anchor.text] if anchor else [td.text])
print(', '.join(row))