I am trying to save a large amount of JSON files, by appending the JSON string as a new line to a large text file. I have limited storage so I don't want to save the JSON string as is for so many files. Instead, I try to compress the JSON string using zlib library, and then append the compressed string as a new line to the big file.
The compression is pretty good, however the problem is that it often happens that the compressed string contains a line break character "\n", which causes error for decompression when reading line by line. I tried to overcome this problem by using base64 encoding for the zlib compressed string, since bas64 does not have line breaks, but it causes the final string to be much longer and hence the compression is less effective (actually for shorter strings, the final string after zlib/base64 is longer than the original string).
import zlib, base64
item_dict={}
item_dict["a"]="ما هذا الذي قاله اليوم بشأن الأخبارية التي فلتها متعمدا؟"
item_dict["b"]="She’s allowed to not want someone else’s kids in her picture. Y’all are weird for the way youre acting over this. I don’t want any pics of myself with my ex’s children, because they aren’t my children and I’m not in their lives anymore. It’s weird to post pics of someone else’s kids… so asking for them to be removed so I can still enjoy my picture from my holiday isn’t as bad as y’all are making it seem."
item_dict["c"]='''
{"symbol": "A/RES/74/1", "resolution_number": "74/1.", "title": "Scale of assessments for the apportionment of the expenses of the United Nations: requests under Article 19 of the Charter", "session": "Seventy-fourth session", "adoption_meeting": "14th plenary meeting", "adoption_date": "2019-10-10 00:00:00", "originating_document": "A/74/483", "report_paragraph": "6", "committee": "Fifth Committee", "agenda_item": "Agenda item 139", "agenda_item_name": "Scale of assessments for the apportionment of the expenses of the United Nations", "voting_type": "Without a vote", "MS_in_favour_count": "N.A.", "MS_against_count": "N.A.", "MS_abstaining_count": "N.A.", "pv": "A/74/PV.14", "MS_in_favour": [], "MS_against": [], "MS_abstaining": [], "sponsors": ["SUBMITTED BY THE CHAIR OF THE COMMITTEE"], "additional_sponsors": [], "SDGs": [], "subjects": [["Comoros", "UNBIS Thesaurus"], ["Sao Tome And Principe", "UNBIS Thesaurus"], ["Somalia", "UNBIS Thesaurus"]]}
{"symbol": "A/RES/74/2", "resolution_number": "74/2.", "title": "Political declaration of the high-level meeting on universal health coverage", "session": "Seventy-fourth session", "adoption_meeting": "14th plenary meeting", "adoption_date": "2019-10-10 00:00:00", "originating_document": "A/74/L.4", "report_paragraph": "N.A.", "committee": "Without reference to a Main Committee", "agenda_item": "Agenda item 126", "agenda_item_name": "Global health and foreign policy", "voting_type": "Without a vote", "MS_in_favour_count": "N.A.", "MS_against_count": "N.A.", "MS_abstaining_count": "N.A.", "pv": "A/74/PV.14", "MS_in_favour": [], "MS_against": [], "MS_abstaining": [], "sponsors": ["SUBMITTED BY THE PRESIDENT OF THE GENERAL ASSEMBLY"], "additional_sponsors": [], "SDGs": ["3"], "subjects": [["Health Policy", "UNBIS Thesaurus"], ["Public Health", "UNBIS Thesaurus"], ["Health Services", "UNBIS Thesaurus"], ["Health Insurance", "UNBIS Thesaurus"], ["Declarations (Text)", "UNBIS Thesaurus"]]}
{"symbol": "A/RES/74/3", "resolution_number": "74/3.", "title": "Political declaration of the high-level meeting to review progress made in addressing the priorities of small island developing States through the implementation of the SIDS Accelerated Modalities of Action (SAMOA) Pathway", "session": "Seventy-fourth session", "adoption_meeting": "14th plenary meeting", "adoption_date": "2019-10-10 00:00:00", "originating_document": "A/74/L.3", "report_paragraph": "N.A.", "committee": "Without reference to a Main Committee", "agenda_item": "Agenda item 19 (b)", "agenda_item_name": "Sustainable development: follow-up to and implementation of the SIDS Accelerated Modalities of Action (SAMOA) Pathway and the Mauritius Strategy for the Further Implementation of the Programme of Action for the Sustainable Development of Small Island Developing States of the SIDS Accelerated Modalities of Action (SAMOA) Pathway and the Mauritius Strategy for the Further Implementation of the Programme of Action for the Sustainable Development of Small Island Developing States", "voting_type": "Without a vote", "MS_in_favour_count": "N.A.", "MS_against_count": "N.A.", "MS_abstaining_count": "N.A.", "pv": "A/74/PV.14", "MS_in_favour": [], "MS_against": [], "MS_abstaining": [], "sponsors": ["SUBMITTED BY THE PRESIDENT OF THE GENERAL ASSEMBLY"], "additional_sponsors": [], "SDGs": ["16", "17", "3"], "subjects": [["Sustainable Development", "UNBIS Thesaurus"], ["Developing Island Countries", "UNBIS Thesaurus"], ["Development Assistance", "UNBIS Thesaurus"], ["Programme Implementation", "UNBIS Thesaurus"], ["Programme Evaluation", "UNBIS Thesaurus"], ["Declarations (Text)", "UNBIS Thesaurus"]]}
'''
item_dict["d"]='{"url": "http://agribank.ngan-hang.net", "final_url": "http://ww7.ngan-hang.net/?usid=18&utid=23776691570", "lang": "", "title": "", "description": "", "keywords": "", "phone_numbers": [], "links": [], "social_links": [], "emails": [], "addresses": [], "logos": [], "text": "", "last": 41, "n_items": 1}'
for key,val in item_dict.items():
zlib_compressed=zlib.compress(val.encode())
base64_compressed=base64.b64encode(zlib_compressed)
zlib_n_line_breaks=zlib_compressed.count(b'\n')
base64_line_breaks=base64_compressed.count(b'\n')
print("original size:",len(val)," | zlib:",len(zlib_compressed),"base64",len(base64_compressed),"| zlib_n_line_breaks",zlib_n_line_breaks,base64_line_breaks)
Result:
original size: 56 | zlib: 84 base64 112 | zlib_n_line_breaks 0 0
original size: 407 | zlib: 254 base64 340 | zlib_n_line_breaks 0 0
original size: 3655 | zlib: 941 base64 1256 | zlib_n_line_breaks 1 0
original size: 303 | zlib: 184 base64 248 | zlib_n_line_breaks 1 0
As a work around, I created a custom compression/decompression function, that replaces the line break in compression with an arbitrary string (e.g. 00000), and in the decompression it does the opposite. This reduces the likelihood of decompression errors but does not eliminate it, because it can happen that the original compressed string has this arbitrary string somehow.
I'm aware of this question, not satisfactory though:
So, the question here is the following - Is there any compression algorithm that can compress a string without producing a line break? Or is there a way to reliably post-process zlib compression/decompression output (or the output of any compression algorithm) to avoid line breaks?
Thanks to the answer by Booboo, I realized the difference between a line break character and a slash followed by "n", and I tested it and it now makes sense for the encoding part:
import zlib
line0='{"symbol": "A/RES/74/1", "resolution_number": "74/1.", "title": "Scale of assessments for the apportionment of the expenses of the United Nations: requests under Article 19 of the Charter", "session": "Seventy-fourth session", "adoption_meeting": "14th plenary meeting", "adoption_date": "2019-10-10 00:00:00", "originating_document": "A/74/483", "report_paragraph": "6", "committee": "Fifth Committee", "agenda_item": "Agenda item 139", "agenda_item_name": "Scale of assessments for the apportionment of the expenses of the United Nations", "voting_type": "Without a vote", "MS_in_favour_count": "N.A.", "MS_against_count": "N.A.", "MS_abstaining_count": "N.A.", "pv": "A/74/PV.14", "MS_in_favour": [], "MS_against": [], "MS_abstaining": [], "sponsors": ["SUBMITTED BY THE CHAIR OF THE COMMITTEE"], "additional_sponsors": [], "SDGs": [], "subjects": [["Comoros", "UNBIS Thesaurus"], ["Sao Tome And Principe", "UNBIS Thesaurus"], ["Somalia", "UNBIS Thesaurus"]]} {"url": "http://agroreal911.sk", "final_url": "http://agroreal911.sk/", "lang": "sk-SK", "title": "Agroreal 911 s.r.o.", "description": "", "keywords": "", "phone_numbers": [], "links": [["http://agroreal911.sk/pozemky", "K\u00fapa p\u00f4dy"], ["http://agroreal911.sk/kontakty", "Kontakty"], ["http://www.advertplus.sk", "Advertplus.sk"], ["http://agroreal911.sk/predaj-pody", "Predaj p\u00f4dy"], ["http://agroreal911.sk/o-nas", "O n\u00e1s"], ["http://agroreal911.sk/?lang=en", ""], ["http://transposh.org/sk", ""]], "social_links": [], "emails": ["mgr.michal.hrabovsky@gmail.com"], "addresses": [], "logos": ["http://agroreal911.sk/wp-content/plugins/transposh-translation-filter-for-wordpress/img/tplogo.png"], "text": "Agroreal 911 s.r.o. \nAGRO REAL 911, S.R.O. \nMenu \nO n\u00e1s \nPOZEMKY \nK\u00fapa p\u00f4dy \nPredaj p\u00f4dy \nKontakty \nby \nWebstr\u00e1nku vytvoril Advertplus.sk Kontakt: 0908 692 782 \u00a0\u00a0\u00a0\u00a0\n\n\n \n \n\n\n\n\n\n\n\n mgr.michal.hrabovsky@gmail.com\n\n\n ", "last": 74, "n_items": 2}'
compressed=zlib.compress(line0.encode())
compressed0=compressed.replace(b"\n",b"\\n")
print("number of line breaks in zlib output:", compressed.count(b"\n"))
test_out_fpath="test_compress.txt"
fopen0=open(test_out_fpath,"wb")
fopen0.write(compressed0)
fopen0.close()
fopen0=open(test_out_fpath,"rb")
lines=fopen0.readlines()
print("number of lines after replacing line breaks", len(lines))
fopen0.close()
Output
number of line breaks in zlib output: 7
number of lines after replacing line breaks 1
I'd still need help with the decompression though, if possible
In the compressed JSON
you should first replace every backslash (b'\\') with two backslashes (b'\\\\') and then replace a newline (b'\n') with a backslash followed by 'n' (b'\\n'). After writing out the compressed data you write out a newline (b'\n'). You can use the str.replace
method for doing these replacements,
You reverse the operation by first doing a readline
and then stripping out the trailing b'\n'). You then must look at each character one by one and if you see a backslash then if the next character is also a backslash, you replace the two backslashes with a single one. Otherwise, the next character must be a 'n' and you replace the two characters with a newline. Note that the results of doing a call to readline
on a binary file is a bytes
instance and when you iterate through this each element is an int
in the range 0 through 255. So when you want to look for a backslash, for example, you should be comparing the element with ord(b'\\')
, which converts a backslash character to its corresponding integer value.
Update
Read the docstrings in the following two functions:
def replace_newlines(s: bytes) -> bytes:
"""Given a bytes instance consisting of a `JSON` string encoded
to bytes and then compressed, replace backslash characters
with two backslashes and newline characters with the
two-character sequence of a backslash followed by 'n'."""
return s.replace(b'\\', b'\\\\').replace(b'\n', b'\\n')
def restore_newlines(s: bytes) -> bytes:
"""Given a bytes instance that have had newline
characters removed with function replace_newline, this function
will restore the newline characters."""
BACKSLASH = ord(b'\\')
NEWLINE = ord(b'\n')
output = []
idx = 0
lnth = len(s)
while idx < lnth:
if s[idx] != BACKSLASH:
output.append(s[idx])
idx += 1
elif s[idx + 1] == BACKSLASH:
output.append(BACKSLASH)
idx += 2
else:
# next character must be b'n'
output.append(NEWLINE)
idx += 2
return bytes(output)
An Alternate restore_newlines
Implementation
import re
...
def restore_newlines(s: bytes) -> bytes:
"""Given a bytes instance that have had newline
characters removed with function replace_newline, this function
will restore the newline characters."""
# A backslash can only be followed by another backslash or 'n':
return re.sub(br'\\(.)', lambda m: b'\\' if m[1] == b'\\' else b'\n', s)