python performance unicode processing-efficiency

Replace several words in a text with Python

I use the below code to remove all HTML tags from a file and convert it to a plain text. Moreover, I have to convert XML/HTML characters to ASCII ones. Here, I have 21 lines which read whole the text. It means if I want to convert a huge file, I have to expend a lot of resource to do this.

Do you have any idea to increase the efficiency of the code and increase its speed while decrease the usage of the resources?

# -*- coding: utf-8 -*-
import re

# This file contains HTML.
file = open('input-file.html', 'r')
temp = file.read()

# Replace Some XML/HTML characters to ASCII ones.
temp = temp.replace ('&lsquo;',"""'""")
temp = temp.replace ('&rsquo;',"""'""")
temp = temp.replace ('&ldquo;',"""\"""")
temp = temp.replace ('&rdquo;',"""\"""")
temp = temp.replace ('&sbquo;',""",""")
temp = temp.replace ('&prime;',"""'""")
temp = temp.replace ('&Prime;',"""\"""")
temp = temp.replace ('&laquo;',"""«""")
temp = temp.replace ('&raquo;',"""»""")
temp = temp.replace ('&lsaquo;',"""‹""")
temp = temp.replace ('&rsaquo;',"""›""")
temp = temp.replace ('&amp;',"""&""")
temp = temp.replace ('&ndash;',""" – """)
temp = temp.replace ('&mdash;',""" — """)
temp = temp.replace ('&reg;',"""®""")
temp = temp.replace ('&copy;',"""©""")
temp = temp.replace ('&trade;',"""™""")
temp = temp.replace ('&para;',"""¶""")
temp = temp.replace ('&bull;',"""•""")
temp = temp.replace ('&middot;',"""·""")

# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)
print(result)

# Write the result to a new file.
file = open("output-file.txt", "w")
file.write(result)
file.close()

Solution

The problem of using sting.tranlate() or string.maketran() is that when I use them I have to assign A char to another one. e.g.

print string.maketran("abc","123")

But, I need to assign a HTML/XML char like ‘ to the single quotation (') in ASCII. It means that I have to use the following code:

print string.maketran("&lsquo;","'")

It faces the following error:

ValueError: maketrans arguments must have same length

Whereas, if I use HTMLParser, it will convert all HTML/XML to ASCII without the above problem. I also have added a encode('utf-8') to solve the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 246: ordinal not in range(128)

# -*- coding: utf-8 -*-
import re
from HTMLParser import HTMLParser

# This file contains HTML.
file = open('input-file.txt', 'r')
temp = file.read()

# Replace all XML/HTML characters to ASCII ones.
temp = HTMLParser.unescape.__func__(HTMLParser, temp)

# Replace HTML tags with an empty string.
result = re.sub("<.*?>", "", temp)

# Encode the text to UTF-8 for preventing some errors.
result = result.encode('utf-8')
print(result)

# Write the result to a new file.
file = open("output-file.txt", "w")
file.write(result)
file.close()