Search code examples
pythonhtmlxmlbeautifulsoupcdata

How to extract data within a cdata tag using python?


I used beautiful soup to get CDATA from a html page but i have to extract contents from it and put it in a csv file.

this is my code:

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import csv
f = open('try.html')
ff = csv.writer(open("profiletry.csv", "w"))
ff.writerow(["cdata"]) 
soup = BeautifulSoup(''.join(f.readlines()))
cdata = soup.find(text=re.compile("CDATA"))
print(cdata)
ff.writerow([cdata])
newfile = open('cdatatxt.txt','w')
newfile.write(cdata)
soup = BeautifulSoup(''.join(f.readlines()))
c_data = soup.find(text=re.compile("string"))
print(c_data)

If i compile this cdata is printed but i want to get data within it in a key-value pair so that i can store it in a csv file.


Solution

  • This may help you.

     import re
     from bs4 import BeautifulSoup
    
     soup = BeautifulSoup(content)
     for x in soup.find_all('item'):
     print re.sub('[\[CDATA\]]', '', x.string)