Search code examples
pythonpyparsing

In PyParsing, how to ignore lines which may start with whitespace?


I'm trying to parse data from files similar to the following (which I've named foo_badging.txt):

package: name='com.sec.android.app.camera.shootingmode.dual' versionCode='6' versionName='1.003' platformBuildVersionName='5.0.1-1624448'
sdkVersion:'17'
uses-permission: name='android.permission.CAMERA'
application-icon-640:'res/mipmap-xxhdpi-v4/application_manager_camera_mode_ic_dual_camera.png'
application: label='Dual camera' icon='res/mipmap-hdpi-v4/application_manager_camera_mode_ic_dual_camera.png'
feature-group: label=''
  uses-feature: name='android.hardware.camera'
  uses-implied-feature: name='android.hardware.camera' reason='requested android.permission.CAMERA permission'
  uses-feature: name='android.hardware.touchscreen'
  uses-implied-feature: name='android.hardware.touchscreen' reason='default feature for all apps'
other-activities
supports-screens: 'small' 'normal' 'large' 'xlarge'
supports-any-density: 'true'
locales: '--_--' 'ca' 'da' 'fa' 'ga' 'ja' 'pa' 'nb' 'be' 'de' 'ne' 'bg' 'mg' 'tg' 'th' 'xh' 'fi' 'hi' 'si' 'vi' 'sk' 'tk' 'uk' 'el' 'nl' 'pl' 'sl' 'tl' 'bn' 'in' 'ko' 'ro' 'sq' 'ar' 'fr' 'hr' 'or' 'sr' 'tr' 'as' 'cs' 'it' 'lt' 'gu' 'hu' 'ru' 'zu' 'lv' 'sv' 'iw' 'fr-CA' 'lo-LA' 'bn-BD' 'et-EE' 'ka-GE' 'ky-KG' 'my-ZG' 'km-KH' 'en-PH' 'zh-HK' 'mk-MK' 'ur-PK' 'hy-AM' 'my-MM' 'zh-CN' 'ta-IN' 'te-IN' 'ml-IN' 'bn-IN' 'kn-IN' 'mr-IN' 'mn-MN' 'pl-SP' 'pt-BR' 'gl-ES' 'es-ES' 'eu-ES' 'is-IS' 'en-US' 'es-US' 'pt-PT' 'zh-TW' 'ms-MY' 'az-AZ' 'kk-KZ' 'uz-UZ'
densities: '160' '240' '320' '480' '640'

I'd like to start by parsing the first few lines (package and sdkVersion) and then 'skip' several lines till I get to the supports-screens line. Here is what I have so far:

from pyparsing import Literal, QuotedString, LineEnd, Optional, OneOrMore, LineStart, Regex, White

with open('foo_badging.txt') as fp:
    badging = fp.read()

package_name = "name=" + QuotedString(quoteChar="'")("name")
versionCode = "versionCode=" + QuotedString(quoteChar="'")("versionCode")
versionName = "versionName=" + QuotedString(quoteChar="'")("versionName")
platformBuildVersionName = "platformBuildVersionName=" + QuotedString(quoteChar="'")("platformBuildVersionName")
sdkVersion = "sdkVersion:" + QuotedString(quoteChar="'")("sdkVersion")
targetSdkVersion = "targetSdkVersion:" + QuotedString(quoteChar="'")("targetSdkVersion")

not_supports_screens_line = LineStart() + Regex(r"(?!supports-screens:).*")     # Negative lookahead assertion for a line starting with "supports-screens:"

supports_screens = "supports-screens:" + QuotedString(quoteChar="'")("supports_screens")

expression = Literal("package:") + package_name + versionCode + versionName + platformBuildVersionName + LineEnd() \
                + Optional(sdkVersion + LineEnd()) \
                + Optional(targetSdkVersion + LineEnd()) \
                + OneOrMore(not_supports_screens_line) \
                + supports_screens + LineEnd()

tokens = expression.parseString(badging)

The problem is that I get a ParseException at the indented use-feature line:

Traceback (most recent call last):
  File "/home/kurt/Documents/Scratch/apk_checker/apk_check.py", line 82, in <module>
    tokens = expression.parseString(badging)
  File "/usr/local/lib/python2.7/dist-packages/pyparsing.py", line 1632, in parseString
    raise exc
pyparsing.ParseException: Expected "supports-screens:" (at char 435), (line:7, col:3)

Apparently this indented line is not counted as a not_supports_screens_line, presumably because unlike the others, it starts with two whitespaces. I've tried modifying the Regex to

not_supports_screens_line = LineStart() + Regex(r"\s*(?!supports-screens:).*")

with a \s*, as well as

not_supports_screens_line = LineStart() + Optional(White()) + Regex(r"(?!supports-screens:).*")

but in both cases I still get the same error message. How can I make not_supports_screens_line also match these indented lines?


Solution

  • Following Paul McGuire's comment, I used SkipTo to avoid having to formulate a complex negative lookahead expression for lines I am not interested in. Here is the resulting code:

    def convert_to_int(tokens):
        return int(tokens[0])
    
    with open('foo_badging.txt') as fp:
        badging = fp.read()
    
    package_name = "name=" + QuotedString(quoteChar="'")("name")
    versionCode = "versionCode=" + QuotedString(quoteChar="'")("versionCode").setParseAction(convert_to_int)
    versionName = "versionName=" + QuotedString(quoteChar="'")("versionName")
    platformBuildVersionName = "platformBuildVersionName=" + QuotedString(quoteChar="'")("platformBuildVersionName")
    sdkVersion = "sdkVersion:" + QuotedString(quoteChar="'")("sdkVersion").setParseAction(convert_to_int)
    targetSdkVersion = "targetSdkVersion:" + QuotedString(quoteChar="'")("targetSdkVersion").setParseAction(convert_to_int)
    
    supports_screens = LineStart() + "supports-screens:" + QuotedString(quoteChar="'")("supports_screens")
    
    expression = Literal("package:") + package_name + versionCode + versionName + platformBuildVersionName + LineEnd() \
                    + Optional(sdkVersion + LineEnd()) \
                    + Optional(targetSdkVersion + LineEnd()) \
                    + SkipTo("supports-screens:") + supports_screens
    
    tokens = expression.parseString(badging)
    
    print tokens.asDict()
    

    which prints

    {'sdkVersion': 17, 'name': 'com.sec.android.app.camera.shootingmode.dual', 'platformBuildVersionName': '5.0.1-1624448', 'supports_screens': 'small', 'versionName': '1.003', 'versionCode': 6}
    

    including the supports_screens field as desired.