data wrangling with Flask: how to do this using SQL language? Does it make sense to use pandas?

Quite new to SQL, and working with flask and sqlalchemy here is my issue (I hope it's not too long)

Overview:

I have a SQL table structured like this:

name    vector       axis       value       unit    ref 
----------------------------------------------------------------    
name1       v1          W       46504       psi     ref1
name1       v1          L       51757       psi     ref1
name1       v2          W         127       psi     another ref
name1       v2          L         403       psi     ref1
name2 ...

My aim is to "unstack" the results such as I could have one row per name as long as unit and ref are the same.

e.g. I wish to get something like:

name      v1-L      v2-W    v1-L    v2-W    unit    ref
--------------------------------------------------------------
name1    46504               127    403     psi    ref1
name1                127                    psi    another ref
name2...

Trying with sqlalchemy:

so far, I tried to join the same table based on 'name' --for now, no check on unit neither ref --:

from sqlalchemy import create_engine
from sqlalchemy import Column, Integer, String, Table, Text, Date, Float
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import  sessionmaker, aliased
engine = create_engine('sqlite://') #, echo=True)
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base()

class Test(Base):
    __tablename__ = 'test'
    id = Column(Integer, primary_key=True)
    name = Column(String(32))
    vector = Column(String(32))
    axis = Column(String(1))
    value = Column(Float)
    unit = Column(String(16), default='psi')
    ref = Column(String(32))

Base.metadata.create_all(engine)

# some data to play with
data = [{'name':'name1', 'vector':'v1', 'axis':'W', 'value':'46504', 'unit':'psi', 'ref':'ref1'},
        {'name':'name1', 'vector':'v1', 'axis':'L', 'value':'51757', 'unit':'psi', 'ref':'ref1'},
        {'name':'name1', 'vector':'v2', 'axis':'W', 'value':'127', 'unit':'psi', 'ref':'another ref'},
        {'name':'name1', 'vector':'v2', 'axis':'L', 'value':'403', 'unit':'psi', 'ref':'ref1'},
        {'name':'name2', 'vector':'v1', 'axis':'L', 'value':'23000', 'unit':'psi', 'ref':'ref1'},
        {'name':'name2', 'vector':'v1', 'axis':'W', 'value':'27000', 'unit':'psi', 'ref':'ref1'},
        {'name':'name2', 'vector':'v2', 'axis':'L', 'value':'523', 'unit':'psi', 'ref':'ref1'},
        {'name':'name2', 'vector':'v2', 'axis':'W', 'value':'217', 'unit':'psi', 'ref':'ref1'},]
for dic in data:
    t = Test(**dic)
    session.add(t)
session.commit()

test_alias = aliased(Test)
q = session.query(Test.id, Test.name, Test.value).filter(Test.vector == 'v1').\
    join(test_alias, Test.name).filter(test_alias.vector == 'v2')
print q

Example using pandas:

Here is what I would get using pandas library.

import pandas as pd
q = session.query(Test).order_by(Test.id) # that is the default table
row2dict = lambda r: {c.name: getattr(r, c.name) for c in r.__table__.columns}
df = pd.DataFrame([row2dict(i) for i in q])
df = df.drop(['id'], axis=1)

df = df.set_index(['ref', 'unit', 'name', 'vector', 'axis']).sort()
df = df.unstack(level=-2).unstack(level=-1)['value'].reset_index()
print(df)

vector          ref unit   name     v1          v2     
axis                                 L      W    L    W
0       another ref  psi  name1    NaN    NaN  NaN  127
1              ref1  psi  name1  51757  46504  403  NaN
2              ref1  psi  name2  23000  27000  523  217

...which is not so far from my expect.

So does it make sense to do this in SQL language? Thus my following question is: using Flask framework, does it make sense to do this data munging using pandas? or should I stick to SQL-language?

Solution

Pandas is probably better for this kind of thing. There might be some more exotic SQL functions that do a transformation like this, but I'm not sure. Below is your example doing the most simplistic approach, which is just to JOIN a view of each one together. The way this comes to mind is that since you are producing rows that have "virtual" columns derived from data, that suggests a composite row is being created from more fundamental rows. So the approach below breaks out a set of rows corresponding to each of the four conditions, v1/W, v1/L, v2/W, v2/L. Now if in practice, there's any number of "vN", that's where the transformational ability of Pandas is probably more appropriate.

from sqlalchemy import *
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import Session, aliased

Base = declarative_base()

class Test(Base):
    __tablename__ = 'test'
    id = Column(Integer, primary_key=True)
    name = Column(String(32))
    vector = Column(String(32))
    axis = Column(String(1))
    value = Column(Float)
    unit = Column(String(16), default='psi')
    ref = Column(String(32))

engine = create_engine('sqlite://', echo=True)
session = Session(engine)

Base.metadata.create_all(engine)

# some data to play with
data = [{'name':'name1', 'vector':'v1', 'axis':'W', 'value':'46504', 'unit':'psi', 'ref':'ref1'},
        {'name':'name1', 'vector':'v1', 'axis':'L', 'value':'51757', 'unit':'psi', 'ref':'ref1'},
        {'name':'name1', 'vector':'v2', 'axis':'W', 'value':'127', 'unit':'psi', 'ref':'another ref'},
        {'name':'name1', 'vector':'v2', 'axis':'L', 'value':'403', 'unit':'psi', 'ref':'ref1'},
        {'name':'name2', 'vector':'v1', 'axis':'L', 'value':'23000', 'unit':'psi', 'ref':'ref1'},
        {'name':'name2', 'vector':'v1', 'axis':'W', 'value':'27000', 'unit':'psi', 'ref':'ref1'},
        {'name':'name2', 'vector':'v2', 'axis':'L', 'value':'523', 'unit':'psi', 'ref':'ref1'},
        {'name':'name2', 'vector':'v2', 'axis':'W', 'value':'217', 'unit':'psi', 'ref':'ref1'},]
for dic in data:
    t = Test(**dic)
    session.add(t)
session.commit()

axis_w = session.query(Test).filter(Test.axis == "W")
axis_l = session.query(Test).filter(Test.axis == "L")
axis_v1_w = axis_w.filter(Test.vector == "v1").subquery()
axis_v1_l = axis_l.filter(Test.vector == "v1").subquery()
axis_v2_w = axis_w.filter(Test.vector == "v2").subquery()
axis_v2_l = axis_l.filter(Test.vector == "v2").subquery()

def join_axes(left, right):
    return and_(
                left.c.unit == right.c.unit,
                left.c.ref == right.c.ref
            )

name_unit_ref = session.query(Test.name, Test.unit, Test.ref).distinct().subquery()

q = session.query(name_unit_ref.c.name,
                axis_v1_w.c.value.label('v1_w'),
                axis_v1_l.c.value.label('v1_l'),
                axis_v2_w.c.value.label('v2_w'),
                axis_v2_l.c.value.label('v2_l'),
                name_unit_ref.c.unit,
                name_unit_ref.c.ref
            ).\
            outerjoin(axis_v1_w, join_axes(name_unit_ref, axis_v1_w)).\
            outerjoin(axis_v1_l, join_axes(name_unit_ref, axis_v1_l)).\
            outerjoin(axis_v2_w, join_axes(name_unit_ref, axis_v2_w)).\
            outerjoin(axis_v2_l, join_axes(name_unit_ref, axis_v2_l))

for row in q:
    print row