I am working on a data cleaning problem wherein I have a task to remove HTML tags from string while keeping the content of text.
Example text for cleanup is given below. I tried removing "pre" tags and somehow i do not get any data.
x = '<pre>i am </pre><p> siddharth </p><pre> sid </pre>'
re.sub(r'<pre>.*<\/pre>', '', x)
If I try adding "\n" which i deleted before, i do get output as shown below
x = '<pre>i am </pre>\n<p> siddharth </p><pre> sid </pre>'
re.sub(r'<pre>.*<\/pre>', '', x)
output - '\n
A string from dataset for cleanup is given below for reference
'<p>I\'ve written a database generation script in <a href="http://en.wikipedia.org/wiki/SQL">SQL</a> and want to execute it in my <a href="http://en.wikipedia.org/wiki/Adobe_Integrated_Runtime">Adobe AIR</a> application: </p> <pre> <code> Create Table tRole ( roleID integer Primary Key ,roleName varchar(40));Create Table tFile ( fileID integer Primary Key ,fileName varchar(50) ,fileDescription varchar(500) ,thumbnailID integer ,fileFormatID integer ,categoryID integer ,isFavorite boolean ,dateAdded date ,globalAccessCount integer ,lastAccessTime date ,downloadComplete boolean ,isNew boolean ,isSpotlight boolean ,duration varchar(30));Create Table tCategory ( categoryID integer Primary Key ,categoryName varchar(50) ,parent_categoryID integer);... </code> </pre> <p> I execute this in Adobe AIR using the following methods: </p> <pre> <code> public static function RunSqlFromFile(fileName:String):void { var file:File = File.applicationDirectory.resolvePath(fileName); var stream:FileStream = new FileStream(); stream.open(file, FileMode.READ) var strSql:String = stream.readUTFBytes(stream.bytesAvailable); NonQuery(strSql);}public static function NonQuery(strSQL:String):void{ var sqlConnection:SQLConnection = new SQLConnection(); sqlConnection.open(File.applicationStorageDirectory.resolvePath(DBPATH); var sqlStatement:SQLStatement = new SQLStatement(); sqlStatement.text = strSQL; sqlStatement.sqlConnection = sqlConnection; try { sqlStatement.execute(); } catch (error:SQLError) { Alert.show(error.toString()); }} </code> </pre> <p> No errors are generated, however only <code>tRole</code> exists. It seems that it only looks at the first query (up to the semicolon- if I remove it, the query fails). Is there a way to call multiple queries in one statement?</p>'
Detailed code for cleanup is given below. The array "arr" contains all the text for which cleanup is needed.
arr = [i.replace('\n','') for i in arr]
arr = [re.sub(r'<pre>.*<\/pre>', '', i) for i in arr]
arr = [re.sub(f'<code>.*<\/code>', '', i) for i in arr]
arr = [re.sub('<[^<]+?>', '', i) for i in arr]
Kindly let me know if anyone has experienced same issue and is able to surpass this blockage.
Because of BeautifulSoup
tagging - To remove a specific tag and keep its content may use .unwrap()
from bs4 import BeautifulSoup
html = '''<pre>i am </pre><p> siddharth </p><pre> sid </pre>'''
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.select('pre'):
i am <p> siddharth </p> sid
Or to extract texts only use .get_text()
from bs4 import BeautifulSoup
html = '''<pre>i am </pre>\n<p> siddharth </p><pre> sid </pre>'''
soup = BeautifulSoup(html, 'html.parser')
soup.get_text(' ', strip=True)
i am siddharth sid