Search code examples
pythonalgorithmisoedifact

Check if string only contains characters from a certain ISO specification


Short question: What is the most efficient way to check whether a .TXT file contains only characters defined in a selected ISO specification?

Question with full context: In the German energy market EDIFACT is used to automatically exchange information. Each file exchanged has a header segment which contains information about the contents of the file.

Please find an example of this segment below.

UNB+UNOC:3+9903323000007:500+9900080000007:500+250102:0900+Y48A42R58CRR43++++++

As you can see after the UNB+ we find the content UNOC. This tells us which character set is used in the file. In this case it is ISO/IEC 8859-1.

I would like a python method which checks whether the EDIFACT file contains only characters specified in ISO/IEC 8859-1.

The most simple solution I can think of is doing something like this (pseudo code).

ISO_string = "All characters contained in ISO/IEC 8859-1"
EDIFACT_string = "Contents of EDIFACT file"
is_iso_char = FALSE

For EDIFACT_Char in EDIFACT_string:
    For ISO_char in ISO_string:
        if EDIFACT_Char = ISO_char:
            is_iso_char = TRUE
            break
    if is_iso_char == FALSE:
        raiseerror("File contains char not contained in ISO/IEC 8859-1 and needs to be rejected")
        do_error_handling()
    is_iso_char = FALSE

I studied business informatics and lack the theoretical background for algorithm theory. This feels like a very inefficient method and since EDIFACT needs to be processed quickly I don't want this functionality to be a bottleneck.

Is there an inbuilt pyhton way to do what I want to achieve better?

Update #1:

I wrote this code as suggested by Barmar. To check it I added the Chinese characters for "World" in the file (世界). I expected .decode to throw an error. However it just decodes the byte string and adds some strange characters at the beginning.

File Contents: 世界UNB+UNOC:3+9903323000007:500+9900080000007:500+250102:0900+Y48A42R58CRR43++++++

with open(Filename, "rb") as edifact_file:
    edifact_bytes = edifact_file.read()

try:
    verified_edifact_string = edifact_bytes.decode(encoding='latin_1', errors='strict')
except:
    print("String does not conform to ISO specification")

print(verified_edifact_string)

Prints: enter image description here If I just copy Stackoverflow cuts away some of the characters.

Edit #2: According to Python documentation the ISO/IEC_8859-1 specification is called latin_1 when using Python's .decode() and .encode() methods.


Solution

  • Credits to Barmar for suggesting the use of .decode()

    I found a solution which looks smooth to me.

    If I encode the string using the latin_1 encoding the Chinese characters seem to not be encoded into bytes. I didn't check but I guess the .encode() method omits them since they don't belong to latin_1. If I then convert the encoded string back using .decode() I get a string without the Chinese characters. If I then compare the original with the encoded and decoded string my question is answered whether characters were contained which don't belong to latin_1.

    with open(Filename, "r", encoding="utf-8") as edifact_file:
        edifact_string = edifact_file.read()
    
    encoded_edifact_string = edifact_string.encode('latin_1', 'ignore')
    if encoded_edifact_string.decode('latin_1', 'ignore') == edifact_string:
        print('Is latin_1')
        print(edifact_string)
        print(encoded_edifact_string.decode('latin_1', 'ignore'))
    else:
        print('Is no latin_1')
        print(edifact_string)
        print(encoded_edifact_string.decode('latin_1', 'ignore'))
    

    Next question is now whether looping over the strings and comparing each character is faster or slower than encoding and decoding and comparing afterwards. But I can check that myself.