The problem is that in some of the xml files I scraped from the SEC are newline chars inside the tag. Because of this those xml files are not well-formed.
<footnote id="F4">Shares sold on the open market are reported as an average sell price per share of $56.87; breakdown of shares sold and per share sale prices are as follows; 100 at $56.31; 200 at $56.32; 100 at $56.33; 198 at $56.39; 600 at $56.40; 100 at $56.41; 102 at $56.42; 600 at $56.44; 320 at $56.45; 100 at $56.46; 900 at $56.47; 480 at $56.48; 300 at $56.49; 1,200 at $56.50; 400 at $56.51; 1,130 at $56.52; 600 at $56.53; 100 at $56.54; 1,500 at $56.55; 600 at $56.56; 644 at $56.57; 1,656 at $56.58; 1,070 at $56.59; 2069 at $56.60; 1,831 at $56.61; 1,000 at $56.62; 1,000 at $56.63; 492 at $56.64; 1,400 at $56.65; 920 at $56.66; 1,000 at $56.67; 600 at $56.68; 500 at $56.69; 1,200 at $56.70; 500 at $56.71; 582 at $56.72; 400 at $56.73; 1,108 at $56.74; 37 at $56.75; 710 at $56.76; 630 at $56.77; 1,600 at $56.78; 400 at $56.79; 400 at $56.80; 1,500 at $56.81; 1,100 at $56.82; 100 at $56.83; 800 at $56.84; 200 at $56.85; 1,300 at $56.87; additional shares sold continued on Footnote (5).</footnot
e>
My first thought was that this is because of the different encoding of utf-8 and ISO-8859-1, but the problem remained after changing the encoding. My next solution was a regex which detects those line breaks inside the tag but as they can occur everywhere this solution isnt very reliable.
Do you guys have any ideas on how to solve this problem?
For this txt file with xml part inside it can be done this way:
import re
# open the txt file
with open("0001112679-10-000086.txt", "r", encoding="utf8") as f:
txt = f.read();
# cut out the xml part from the txt file
start = txt.find("<XML>")
end = txt.find("</XML>") + 6
xml = txt[start:end]
# process the xml part
xml = re.sub(r"([^\n]{1023})\n", r"\1", xml)
# combine a new txt back from the parts
new_txt = txt[:start] + xml + txt[end:]
# save the new txt in file
with open("0001112679-10-000086_output.txt", "w", encoding="utf8") as f:
f.write(new_txt)