Search code examples
pythonxmlpython-polars

Escaping XML Characters using Python Polars


I'm working with Polars to build out XML from a table and I want to Escape XML characters. However, I'm running into issues when I try and do this. The first thing I did was try the following:

import polars as pl
from xml.sax.saxutils import escape

table_raw = pl.read_sql("""SELECT * FROM mytable""", engine).lazy()

table = table_raw.select([
    pl.concat_str([
    pl.lit('''<wd:Overall_XML_Tag>''').alias('Overall_XML_header'),

    pl
    .when(pl.col('value') != None).then(pl.format('''<wd:Value_XML_Tag>{}</wd:Value_XML_Tag>''', escape(pl.col('value'))))
    .otherwise(pl.lit(''))
    .alias('value'),

    pl.lit('''</wd:Overall_XML_Tag>''') 
])
])

However, when doing this I get an error at my escape call of "'Expr' object has no attribute 'replace'"

I was able to get the following working by doing a .replace() of reserved characters but it is messy and cumbersome so hoping there is a better way to handle things.

import polars as pl
from xml.sax.saxutils import escape

table_raw = pl.read_sql("""SELECT * FROM mytable""", engine).lazy()

table = table_raw.select([
    pl.concat_str([
    pl.lit('''<wd:Overall_XML_Tag>''').alias('Overall_XML_header'),

    pl
    .when(pl.col('value') != None).then(pl.format('''<wd:Value_XML_Tag>{}</wd:Value_XML_Tag>''', pl.col('value').str.replace('&', '&amp;').str.replace('<', '&lt;').str.replace('>', '&gt;').str.replace("\"", "&quot;").str.replace("'", "&apos;"))))
    .otherwise(pl.lit(''))
    .alias('value'),

    pl.lit('''</wd:Overall_XML_Tag>''') 
])
])

Anyone have a better way to handle this?


Solution

  • Figured out a way to handle this. You can use a custom function like the following:

    import polars as pl
    from xml.sax.saxutils import escape
    
    table_raw = pl.read_sql("""SELECT * FROM mytable""", engine).lazy()
    
    table = table_raw.select([
        pl.concat_str([
        pl.lit('''<wd:Overall_XML_Tag>''').alias('Overall_XML_header'),
    
        pl
        .when(pl.col('value') != None).then(pl.format('''<wd:Value_XML_Tag>{}</wd:Value_XML_Tag>''', pl.col('value').map_elements(lambda x: escape(x))))
        .otherwise(pl.lit(''))
        .alias('value'),
    
        pl.lit('''</wd:Overall_XML_Tag>''') 
    ])
    ])