Search code examples
pythonpython-2.7wikipediawikipedia-apinon-english

Removing Non English Sub headings and Paragraphs


Hi I have a script which is able to remove subheadings and paragraphs but I am not able to remove paragraphs with non english subheadings and words.

For example, (Original Text):

=== Personal finance ===
Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)

=== Corporate finance ===
Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.

== External links ==
Business acronyms and abbreviations
Business acronyms

== Kūrybinės Industrijos ==
Kūrybinės industrijos apima sritį ekonominių veiksnių, susitelkusių ties žinių ir informacijos generavimu arba tyrimu. 

The (Result) I get from my code is :

Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)

Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.

Kūrybinės industrijos apima sritį ekonominių veiksnių, susitelkusių ties žinių ir informacijos generavimu arba tyrimu.

This is what I hope to achieved (Desired Result):

Protection against unforeseen personal events, as well as events in the wider economies
Transference of family wealth across generations (bequests and inheritance)

Corporate finance deals with the sources of funding and the capital structure of corporations and the actions that managers take to increase the value of the firm to the shareholders.

The script is as follows:

import re
from subprocess import call

f1 = open('asd.text', 'r') # read file that contains the orginal text
f2 = open('NoRef.text', 'w') # write to new file

section_title_re = re.compile("^=+\s+.*\s+=+$")

content = []
skip = False
for l in f1.read().splitlines():
    line = l.strip()

    if "== external links ==" in line.lower():
        skip = True  
        continue

    if section_title_re.match(line):
        skip = False
        continue
    if skip:
        continue
    content.append(line)

content = '\n'.join(content) + '\n'
f2.write(content+"\n")
f2.close()

Problem: So far my code is able to remove paragraphs with subheading of known names like "External Links".

But do I remove those subheadings and paragraphs that are non english?

Thank you.


Solution

  • If you only want to detect if a string contains non english characters, thats easy: just try to decode it as ascii: if it fails, it contains character with code above 127:

    try:
         utxt = txt.decode('ascii')
    except:
         # txt contains non "english" characters
         ...
    

    If you want to detect if it contains non english words, that a much more complex question, and you should wonder whether you want to accept english words badly written, such as englich woerds badli writed. Good luck if you want to go that way...