I am trying to write a query in Python using pandasql
. My code is as below,
import pandas as pd
from pandasql import *
data = pd.read_csv('registerlog.csv')
q = """
SELECT
a.RegistrationMonth, COUNT(DISTINCT a.UserID) AS UserSize,
COUNT(
CASE a.MonthDifference
WHEN 0.0 THEN DISTINCT a.UserID ELSE NULL
END
) AS MonthZero
FROM
data) AS a
GROUP BY
a.RegistrationMonth
"""
print sqldf(q, locals())
But this gives the following error,
print sqldf(q, locals()) File "C:\Python27\lib\site-packages\pandasql\sqldf.py", line 156, in sqldf return PandaSQL(db_uri)(query, env) File "C:\Python27\lib\site-packages\pandasql\sqldf.py", line 63, in call raise PandaSQLException(ex) PandaSQLException: (sqlite3.OperationalError) near "DISTINCT": syntax error
But if I use WHEN 0.0 THEN a.user_id ELSE NULL
then it works. Also the normal way of COUNT(DISTINCT a.user_id)
also works fine.
But I want to get only the DISTINCT
values inside the CASE
. Is there a way to achieve this to get the DISTINCT
count value inside the CASE
?
In the SQL grammer, DISTINCT does not belong to any values (expressions), but to the SELECT or the aggregate function (here: COUNT). So you have to write it directly after the SELECT or the opening parenthesis:
SELECT ..., COUNT(DISTINCT CASE ... END) ...