Search code examples
pythonpandassqlitepandasql

Python: Get DISTINCT column values from inside CASE function in pandasql


I am trying to write a query in Python using pandasql. My code is as below,

import pandas as pd
from pandasql import *

data = pd.read_csv('registerlog.csv')

q = """
SELECT
    a.RegistrationMonth, COUNT(DISTINCT a.UserID) AS UserSize,
    COUNT(
        CASE a.MonthDifference
            WHEN 0.0 THEN DISTINCT a.UserID ELSE NULL
        END
    ) AS MonthZero
FROM
    data) AS a
GROUP BY
    a.RegistrationMonth
"""

print sqldf(q, locals())

But this gives the following error,

print sqldf(q, locals()) File "C:\Python27\lib\site-packages\pandasql\sqldf.py", line 156, in sqldf return PandaSQL(db_uri)(query, env) File "C:\Python27\lib\site-packages\pandasql\sqldf.py", line 63, in call raise PandaSQLException(ex) PandaSQLException: (sqlite3.OperationalError) near "DISTINCT": syntax error

But if I use WHEN 0.0 THEN a.user_id ELSE NULL then it works. Also the normal way of COUNT(DISTINCT a.user_id) also works fine.

But I want to get only the DISTINCT values inside the CASE. Is there a way to achieve this to get the DISTINCT count value inside the CASE?


Solution

  • In the SQL grammer, DISTINCT does not belong to any values (expressions), but to the SELECT or the aggregate function (here: COUNT). So you have to write it directly after the SELECT or the opening parenthesis:

    SELECT ..., COUNT(DISTINCT CASE ... END) ...