Search code examples
pythondsltabular

A DSL in Python for tabular analysis and processing?


I am carrying around the idea to create a restricted form of a language for tabular analysis. Thequestion is if there is something like the following already around or if there are better ideas todo it. The key requirement is that the user must be able to employ naked script files without thenecessity to produce a full Python program. OTOH elements of the script language are compiled at runtime andshould offer a possibility to formulate complex conditions and calculations (mostly arithmetic and string operations). The declarative way to formulate programs in the language (see below) prohibits the direct use of Python syntax as the vehicle for the language (see the implications of the @PART function) - at least I think so. Is there a better/ more clever/ more elegant way to achieve my goals than to program around in a semi-parsed mixture of Python and clumsy self-defined syntax like I did below?

Below I try to make my ideas clear by an example. The input table is produced by another program part and looks like this on start of the script interpretation:

# First,Last,Department,Hourly Wage
[ ('Greg','Foo','HR',100),
  ('Judy','Bar','EE',51),
  ('Jake','Baz','HR',75),
  ('Lila','Bax','HR',49),
  ('Norm','Fob','EE',49) ]

Following is the 'script' file itself. This will be a file for itself in a production system. The program code is currently laid down as an array of Python strings - maybe not even that in a final version.

# A program to produce per department the average hourly rate, separated for the higher and lower 50% of earners: 
[ "@SORT(2,-3)", 
     "@SET({max},@MAX({3}))",
     "@PART({2}!={^2} or {3}<{max}/2)",
        "@SET({dep},@FIRST({2}))",
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"
]

I will try to explain step by step what the script shall do:

"@SORT(2,-3)" 

sorts the table after column 2 (ascending), then column 3 (descending). We get

[ ('Judy','Bar','EE',51),
  ('Norm','Fob','EE',49),
  ('Greg','Foo','HR',100),
  ('Jake','Baz','HR',75),
  ('Lila','Bax','HR',49),
]

"@SET({max},@MAX({3}))" 

takes the maximum of column 3 and puts it into a dynamic, local variable max

"@PART({2}!={^2} or {3}<{max}/2)" 

is a bit more complex. @PART partitions the current table into several sub-tables by evaluating the given expression for each row and cutting in front of the row if true. Here we want to cut at department-borders (column 2). {^2} is an upward reference, meaning the element in column 2 from the previous row. This syntax is necessary as I deem the ability to partition tables on conditions more complex than "row differs from previous row in X" very important (imagine you want to partition the table in 10k-income classes) so I need the expressive power of a (restricted) Python expression in the argument to PART. Also this has the implication that the expression can not be evaluated for the first row as there is no predecessor, so the PART-function will simply go over it. After this function we have the following tables:

[ ('Judy','Bar','EE',51) ] # Department EE

[ ('Norm','Fob','EE',49) ] # Norm Fob is in the same department but earns less than half of the maximum

[ ('Greg','Foo','HR',100), # New department HR
  ('Jake','Baz','HR',75) ]

[ ('Lila','Bax','HR',49) ] # HR dept. but less than half of the best earner

From here on, the functions in the script will work on each sub-table separately. A PART function more or less starts a loop over all resulting sub-tables and every following function (including more PARTs) is executed on a subtable in isolation.

"@SET({dep},@FIRST({2}))",
"@PRINT({dep},float(@SUM({3}))/@CNT({3}))"

@FIRST({2}) takes the value of column 2 of the first row simply. @SUM({3}) takes the sum of the whole column 3 and @CNT({3}) is counting the number of rows which have column 3 unequal to None. I picture the functions outcome roughly here:

[ ('Judy','Bar','EE',51) ]
        "@SET({dep},@FIRST({2}))"  --> {dep} = "EE"
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"  --> output  "EE 51"

[ ('Norm','Fob','EE',49) ]
        "@SET({dep},@FIRST({2}))", --> {dep} = "EE"
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"  --> output  "EE 49"

[ ('Greg','Foo','HR',100), 
  ('Jake','Baz','HR',75) ]
        "@SET({dep},@FIRST({2}))", --> {dep} = "HR"
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"  --> output  "HR 87.5"

[ ('Lila','Bax','HR',49) ] 
        "@SET({dep},@FIRST({2}))", --> {dep} = "HR"
        "@PRINT({dep},float(@SUM({3}))/@CNT({3}))"  --> output  "HR 49"

I should add that I'd prefer to keep the solution small, i.e. no use of non-standard Python packages like pyparsing etc.


Solution

  • littletable is a module that I wrote to do some of this tabular analysis of a list of like items. littletable does not use SQL for selects and queries, but fields can be indexed, and tables can do joins, pivots, and queries. Tables can be treated just like Python lists. Probably the biggest philosophical point about littletable is that every join, query, etc. returns a new Table, so that a complex expression can be built up from intermediate joins and queries. Here are some samples of manipulating your data with littletable:

    attrs = "Id,First,Last,Department,Hourly_Wage".split(',') 
    data = [ (1, 'Greg','Foo','HR',100), 
              (2, 'Judy','Bar','EE',51), 
              (3, 'Jake','Bar','HR',75), 
              (4, 'Lila','Bax','HR',49), 
              (5, 'Norm','Fob','EE',49) ] 
    
    from littletable import Table, DataObject
    
    instructors = Table()
    instructors.insert_many(
        DataObject(**dict(zip(attrs,d))) 
            for d in data)
    
    # can add index before or after items are added to table
    instructors.create_index("Id", unique=True)
    instructors.create_index("Department")
    
    # unique keys are enforced
    try:
        instructors.insert(DataObject(Id=4, First="Bob", Last="Fob"))
    except KeyError as e:
        print e
    
    # keys checked for uniqueness when creating unique index
    try:
        instructors.create_index("Last", unique=True)
    except KeyError as e:
        print e
    
    # Uniquely indexed access gives a single record
    print "%(First)s %(Last)s" % instructors.by.Id[3]
    
    # Non-uniquely indexed access gives a new Table
    print '\n'.join("%(Department)s %(First)s %(Last)s" % inst 
                        for inst in instructors.by.Department["HR"])
    
    # Table can still be accessed like a Python list
    print "%(First)s %(Last)s" % instructors[-1]
    print '\n'.join("%(Department)s %(First)s %(Last)s" % inst 
                        for inst in instructors)
    
    # use pivot for multi-dimensional grouping
    instructors.addfield("wage_bracket", lambda d:d.Hourly_Wage/10*10)
    instructors.create_index("wage_bracket")
    instructors.pivot("Department wage_bracket").dump()
    instructors.pivot("Department wage_bracket").dump_counts()
    
    import sys
    instructors.csv_export(sys.stdout)
    

    prints:

    ("duplicate unique key value '4' for index Id", {'Last': 'Fob', 'Id': 4, 'First': 'Bob'})
    'duplicate key value Bar'
    Jake Bar
    HR Greg Foo
    HR Jake Bar
    HR Lila Bax
    Norm Fob
    HR Greg Foo
    EE Judy Bar
    HR Jake Bar
    HR Lila Bax
    EE Norm Fob
    Pivot: Department,wage_bracket
      Department:EE
        Department:EE/wage_bracket:40
          {'Last': 'Fob', 'Hourly_Wage': 49, 'Department': 'EE', 'wage_bracket': 40, 'Id': 5, 'First': 'Norm'}
        Department:EE/wage_bracket:50
          {'Last': 'Bar', 'Hourly_Wage': 51, 'Department': 'EE', 'wage_bracket': 50, 'Id': 2, 'First': 'Judy'}
      Department:HR
        Department:HR/wage_bracket:40
          {'Last': 'Bax', 'Hourly_Wage': 49, 'Department': 'HR', 'wage_bracket': 40, 'Id': 4, 'First': 'Lila'}
        Department:HR/wage_bracket:70
          {'Last': 'Bar', 'Hourly_Wage': 75, 'Department': 'HR', 'wage_bracket': 70, 'Id': 3, 'First': 'Jake'}
        Department:HR/wage_bracket:100
          {'Last': 'Foo', 'Hourly_Wage': 100, 'Department': 'HR', 'wage_bracket': 100, 'Id': 1, 'First': 'Greg'}
    Pivot: Department,wage_bracket
                  40         50         70        100      Total
    EE             1          1          0          0          2
    HR             1          0          1          1          3
    Total          2          1          1          1          5
    Last,Hourly_Wage,Department,wage_bracket,Id,First
    Foo,100,HR,100,1,Greg
    Bar,51,EE,50,2,Judy
    Bar,75,HR,70,3,Jake
    Bax,49,HR,40,4,Lila
    Fob,49,EE,40,5,Norm