I have lots of data in PostgreSQL. But I need to do some pivot tables like it does SPSS. For example i have table with cities and states.
create table cities
(
city integer,
state integer
);
insert into cities(city,state) values (1,1);
insert into cities(city,state) values (2,2);
insert into cities(city,state) values (3,1);
insert into cities(city,state) values (4,1);
Actually in this table i have 4 cities and 2 states. I want to do pivot table with percentage like
city\state |state-1| state-2|
city1 |33% |0% |
city2 |0% |100% |
city3 |33% |0% |
city4 |33% |0% |
totalCount |3 |1 |
I understant how do to this in this particulary case with sql. But all i want is to cross one variable by another (just count distinct values and devide it by "count(*) where variable_in_column_names=1 and so on) using some stored function. Im looking now at plpython.Some my questions are:
As i can see, input will be table name, column name of first variable, column name of second variable. Doing lots of queries in function's body (count(*),loop thru every distinct value in variables and count it and so on) and then return a table with percentage.
You might want to give pandas a try, which is an excellent python data analysis library.
To query a PostgreSQL database:
import psycopg2
import pandas as pd
from pandas.io.sql import frame_query
conn_string = "host='localhost' dbname='mydb' user='postgres' password='password'"
conn = psycopg2.connect(conn_string)
df = frame_query('select * from cities', con=conn)
Where df
is a DataFrame like:
city state
0 1 1
1 2 2
2 3 1
3 4 1
You can then create a pivot table using pivot_table
and dividing by the total to get percentages:
totals = df.groupby('state').size()
pivot = pd.pivot_table(df, rows='city', cols='state', aggfunc=len, fill_value=0) / totals
Giving you the result:
state 1 2
city
1 0.333333 0
2 0 1
3 0.333333 0
4 0.333333 0
Finally to get the layout you want, you just need to rename the index and columns, and append the totals:
totals_frame = pd.DataFrame(totals).T
totals_frame.index = ['totalCount']
pivot.index = ['city%i' % item for item in pivot.index]
final_result = pivot.append(totals_frame)
final_result.columns = ['state-%i' % item for item in final_result.columns]
Giving you:
state-1 state-2
city1 0.333333 0
city2 0.000000 1
city3 0.333333 0
city4 0.333333 0
totalCount 3.000000 1