Search code examples
pythonpandasdataframeuniquedistinct

Extracting unique price values from dataframe depending on real estate id


I've got a dataframe with data taken from a database like this:

conn = sqlite3.connect('REDB.db')
dataAvg1 = pd.read_sql_query(
     "SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, HOUSEINFO.RE_POLOHA, HOUSEINFO.RE_DRUH, HOUSEINFO.RE_TYP, HOUSEINFO.RE_UPLOCHA FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, HOUSEINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=HOUSEINFO.INF_ID",conn
)
dataAvg2 = pd.read_sql_query(
     "SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, FLATINFO.RE_DISPOZICE, FLATINFO.RE_DRUH, FLATINFO.RE_PPLOCHA FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, FLATINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=FLATINFO.INF_ID",conn
)
dataAvg3 = pd.read_sql_query(
     "SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, LANDINFO.RE_PLOCHA, LANDINFO.RE_DRUH, LANDINFO.RE_SITE, LANDINFO.RE_KOMUNIKACE FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, LANDINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=LANDINFO.INF_ID",conn
)
conn.close()

df2 = [dataAvg1, dataAvg2, dataAvg3]
dfAvg = pd.concat(df2)
dfAvg = dfAvg.reset_index(drop=True)

The main columns are UNIQUE_RE_NUMBER, RE_PRICE and UPDATE_DATE. I would like to count frequency of change in prices each day. Ideally create a new column called 'Frequency' and for each day add a number. For example:

UPDATE_DAY       UNIQUE_RE_NUMBER      RE_PRICE    FREQUENCY
1.1.2021                 1                500          2
1.1.2021                 2                400          2
2.1.2021                 1                500          1
2.1.2021                 2                450          1

I hope this example is understandable.

Right now I have something like this:

dfAvg['FREQUENCY'] = dfAvg.groupby('UPDATE_DATE')['UPDATE_DATE'].transform('count')
dfAvg.drop_duplicates(subset=['UPDATE_DATE'], inplace=True)

This code counts every price added that day, so when the price of real estate on 1.1.2021 was 500 and the next day, its also 500, it counts as "change" in price, but in fact the price stayed the same and I dont want to count that. I would like to select only distinct values in prices for each real estate. Is it possible?


Solution

  • Not sure if this is the most efficient way, but maybe it helps:

    def ident_deltas(sdf):
        return sdf.assign(
            DELTA=(sdf.RE_PRICE.shift(1) != sdf.RE_PRICE).astype(int)
        )
    
    def sum_deltas(sdf):
        return sdf.assign(FREQUENCY=sdf.DELTA.sum())
    
    df = (
        df.groupby("UNIQUE_RE_NUMBER").apply(ident_deltas)
          .groupby("UPDATE_DAY").apply(sum_deltas)
          .drop(columns="DELTA")
    )
    

    Result for

    df = 
      UPDATE_DAY  UNIQUE_RE_NUMBER  RE_PRICE
    0 2021-01-01                 1       500
    1 2021-01-01                 2       400
    2 2021-02-01                 1       500
    3 2021-02-01                 2       450
    

    is

      UPDATE_DAY  UNIQUE_RE_NUMBER  RE_PRICE  FREQUENCY
    0 2021-01-01                 1       500          2
    1 2021-01-01                 2       400          2
    2 2021-02-01                 1       500          1
    3 2021-02-01                 2       450          1