Search code examples
pythonpython-3.xescapingpyyaml

"Unescaping" backslashes in a string


TL;DR;

I want to transform a string (representing a regex) like "\\." into "\." in a clean and resilient way (something akin to sed 's/\\\\/\\/g', I don't know if this could break on edge cases though)
val.decode('string-escape') is not an option since I'm using python3.

What I tried so far:

  • variations of val.replace('\\\\', '\\')
  • looked at the answers to these two questions but couldn't get them to work in my case
    • variations of val.encode().decode('unicode-escape')
  • had a look at the docs for strings but couldn't find a solution

I am sure that I missed a relevant part, because string escaping (and unescaping) seems like a fairly common and basic problem, but I haven't found a solution yet =/

Full Story:

I have a YAML-File like so

- !Scheme
      barcode: _([ACGTacgt]+)[_.]
      lane: _L(\d\d\d)[_.]
      name: RKI
      read: _R(\d)+[_.]
      sample_name: ^(.+)(?:_.+){5}
      set: _S(\d+)[_.]
      user: _U([a-zA-Z0-9\-]+)[_.]
      validation: .*/(?:[a-zA-Z0-9\-]+_)+(?:[a-zA-Z0-9])+\.fastq.*
...

that describes a "Scheme" Object. The 'name' key is an identifier and the rest describe regexes.

I want to be able to parse an object from that YAML so I wrote a from_yaml class method:

scheme = Scheme()
loaded_mapping = loader.construct_mapping(node)  # load yaml-node as dictionary WARNING! loads str escaped

# re.compile all keys except name, adding name as regular string and
# unescaping escaped sequences (like '\') in the process
for key, val in loaded_mapping.items():
    if key == 'name':
        processed_val = val
    else:
        processed_val = re.compile(val)  # backslashes in val are escaped
    scheme.__dict__[key] = processed_val

the problem is that loader.construct_mapping(node) loads the strings with backslashes escaped, so the regex is not correct anymore.

I tried several variations of val.encode().decode('unicode-escape') and val.replace('\\\\', '\\'), but had no luck with it

If anyone has an idea how to handle this I'd appreciate it very much! I am not married to this specific way of doing things and open to alternative approaches.

Kind Regards!


Solution

  • Assuming I have this super simple YAML file

    lane: _L(\d\d\d)[_.]
    

    and load it with PyYAML like this:

    import yaml
    import re
    
    with open('test.yaml', 'rb') as stream:
        data = yaml.safe_load(stream)
    
    lane_pattern = data['lane']
    print(lane_pattern)
    
    lane_expr = re.compile(data['lane'])
    print(lane_expr)
    

    Then the result is exactly as one would expect:

    _L(\d\d\d)[_.]
    re.compile('_L(\\d\\d\\d)[_.]')
    

    There is no double escaping of strings going on when YAML is parsed, so there is nothing for you to unescape.