Short question:
What is the most efficient way to check whether a .TXT
file contains only characters defined in a selected ISO specification?
Question with full context: In the German energy market EDIFACT is used to automatically exchange information. Each file exchanged has a header segment which contains information about the contents of the file.
Please find an example of this segment below.
UNB+UNOC:3+9903323000007:500+9900080000007:500+250102:0900+Y48A42R58CRR43++++++
As you can see after the UNB+
we find the content UNOC
. This tells us which character set is used in the file. In this case it is ISO/IEC 8859-1.
I would like a python method which checks whether the EDIFACT file contains only characters specified in ISO/IEC 8859-1.
The most simple solution I can think of is doing something like this (pseudo code).
ISO_string = "All characters contained in ISO/IEC 8859-1"
EDIFACT_string = "Contents of EDIFACT file"
is_iso_char = FALSE
For EDIFACT_Char in EDIFACT_string:
For ISO_char in ISO_string:
if EDIFACT_Char = ISO_char:
is_iso_char = TRUE
break
if is_iso_char == FALSE:
raiseerror("File contains char not contained in ISO/IEC 8859-1 and needs to be rejected")
do_error_handling()
is_iso_char = FALSE
I studied business informatics and lack the theoretical background for algorithm theory. This feels like a very inefficient method and since EDIFACT needs to be processed quickly I don't want this functionality to be a bottleneck.
Is there an inbuilt pyhton way to do what I want to achieve better?
Update #1:
I wrote this code as suggested by Barmar. To check it I added the Chinese characters for "World" in the file (世界). I expected .decode
to throw an error. However it just decodes the byte string and adds some strange characters at the beginning.
File Contents: 世界UNB+UNOC:3+9903323000007:500+9900080000007:500+250102:0900+Y48A42R58CRR43++++++
with open(Filename, "rb") as edifact_file:
edifact_bytes = edifact_file.read()
try:
verified_edifact_string = edifact_bytes.decode(encoding='latin_1', errors='strict')
except:
print("String does not conform to ISO specification")
print(verified_edifact_string)
Prints:
If I just copy Stackoverflow cuts away some of the characters.
Edit #2:
According to Python documentation the ISO/IEC_8859-1 specification is called latin_1
when using Python's .decode()
and .encode()
methods.
Credits to Barmar for suggesting the use of .decode()
I found a solution which looks smooth to me.
If I encode the string using the latin_1
encoding the Chinese characters seem to not be encoded into bytes. I didn't check but I guess the .encode()
method omits them since they don't belong to latin_1
. If I then convert the encoded string back using .decode()
I get a string without the Chinese characters. If I then compare the original with the encoded and decoded string my question is answered whether characters were contained which don't belong to latin_1
.
with open(Filename, "r", encoding="utf-8") as edifact_file:
edifact_string = edifact_file.read()
encoded_edifact_string = edifact_string.encode('latin_1', 'ignore')
if encoded_edifact_string.decode('latin_1', 'ignore') == edifact_string:
print('Is latin_1')
print(edifact_string)
print(encoded_edifact_string.decode('latin_1', 'ignore'))
else:
print('Is no latin_1')
print(edifact_string)
print(encoded_edifact_string.decode('latin_1', 'ignore'))
Next question is now whether looping over the strings and comparing each character is faster or slower than encoding and decoding and comparing afterwards. But I can check that myself.