Search code examples
pythondataframejoinpython-polars

Polars - How can I make multiple joins cross multiple Dataframes, examples included


import polars as pl

#Auctiondata which is used to create the AuctionDF
auctiondata = {"AuctionId": [2095293259, 2096131235, 2094319272, 2094265820, 2094902378, 2096005275],
               "Bid": [9499998, 8499998, 8500000 , 1400832, 1400000, 872],
               "Buyout": [9499998, 9499998, 8500000, 1450832, 1500000, 900],
               "Quantity": [1, 1, 1, 1, 1, 1],
               "Time_Left": ['Short', 'Very long', 'Long', 'Short', 'Long', 'Long'],
               "ItemId": [24655, 24648, 3184, 14187,6580,1482],
               "ItemRand": [-39, -19, 24, 2032, 1020,None],
               "ItemSeed": [886505522, 483524644, 384031104, 1995900544,2119510144,None],
               "SuffixFactor": [50, 36, 55680, 1664, 10368,None],
               "Faction": ['Horde', 'Alliance', 'Alliance', 'Alliance', 'Horde','Horde'],
               "RealmName": ['Mograine', 'Bloodfang', 'Firemaw', 'Firemaw', 'Mograine','Mograine'],
               "BuyoutGold": ['45', '77', '24', '39', '120','42'],
               "BuyoutSilver": ['40', '44', '12', '33', '12','51'],
               "BuyoutCopper": ['12', '11', '21', '52', '32','42'],
               "BidGold": ['39', '12', '11', '27', '99','23'],
               "BidSilver": ['32', '14', '44', '12', '4','42'],
               "BidCopper": ['21', '12', '32', '12', '45','33']
               }

#itemsData which is used to create the itemsDF
itemsData = {"ID": [24655,24648,3184,14187,6580],
             "Display_lang" : ['Consortium Robe','Astralaan Gloves','Hook Dagger','Raincaller Cuffs','Defender Tunic']}

#suffixData which is used to create the suffixDF
suffixData = {"ID": [19, 39],
               "Name_lang": ['of Intellect', 'of the Invoker'],
               "Enchantment[0]": [2804, 2804],
               "Enchantment[1]": [0, 2824],
               "Enchantment[2]": [0, 2822],
               "Enchantment[3]": [0, 0],
               "Enchantment[4]": [0, 0],
               "AllocationPct[0]": [10000, 5259],
               "AllocationPct[1]": [None, 6153],
               "AllocationPct[2]": [None, 5259],
               "AllocationPct[3]": [None, None],
               "AllocationPct[4]": [None, None]
               }

#propertiesData which is used to create the propertiesDF
propertiesData = {"ID": [24, 1020, 2032],
               "Name_lang": ['of Strength', 'of the Whale', 'of Healing'],
               "Enchantment[0]": [70, 98, 2312],
               "Enchantment[1]": [0, 103, 0],
               "Enchantment[2]": [0, 0, 0],
               "Enchantment[3]": [0, 0, 0],
               "Enchantment[4]": [0, 0, 0]
               }

#enchantmentsData which is used to create the enchantmentDF
enchantmentsData = {"ID": [70, 98, 103, 2312, 2804, 2822, 2824],
                    "Name_lang" : ['+3 Strength','+4 Spirit','+5 Stamina','+7 Spell Power','+$i Intellect', '+$i Critical Strike Rating', '+$i Spell Power']}

#resultData which is used in order to create the resultDF
resultData = {"AuctionId" : [2095293259, 2096131235, 2094319272, 2094265820, 2094902378, 2096005275],
              "ItemId" : [24655, 24648, 3184, 14187, 6580, 1482],
              "ItemName" : ['Consortium Robe','Astralaan Gloves','Hook Dagger','Raincaller Cuffs','Defender Tunic',''],
              "RealmName" : ['Mograine', 'Bloodfang', 'Firemaw', 'Firemaw', 'Mograine','Mograine'],
              "Faction" : ['Horde', 'Alliance', 'Alliance', 'Alliance', 'Horde','Horde'],
              "EnchantmentName" : ['of the Invoker','of Intellect','of Strength','of Healing','of the Whale',''],
              "Stat0" : ['+26 Intellect','+36 Intellect','+3 Strength','+13 Healing Spells and +5 Damage Spells','+4 Spirit',''],
              "Stat1" : ['+30 Spell Damage and Healing','','','','+5 Stamina',''],
              "Stat2" : ['+26 Spell Critical Strike Rating','','','','',''],
              "Stat3" : ['','','','','',''],
              "Stat4" : ['','','','','',''],
              "BuyoutGold" : ['45', '77', '24', '39', '120','42'],
              "BuyoutSilver" : ['40', '44', '12', '33', '12','51'],
              "BuyoutCopper" : ['12', '11', '21', '52', '32','42'],
              "BidGold" : ['39', '12', '11', '27', '99','23'],
              "BidSilver" : ['32', '14', '44', '12', '4','42'],
              "BidCopper" : ['21', '12', '32', '12', '45','33']}

#"Main" DF
auctionDF = pl.DataFrame(auctiondata)

#The ID column of the below Dataframe refrence to "ItemId" in auctionDF, it's not ALWAYS the ItemId from AuctionDF is within the itemsDF tho, but 99% of the time it is.
itemsDF = pl.DataFrame(itemsData)

#All negative ItemRands (ItemRands lower than 0) from AuctionDF refrences to the ID column of suffixDF, so I would imagine one of the first things to do is to make the ID column in suffixDF negative for a later join?
suffixDF = pl.DataFrame(suffixData)

#All positive ItemRands (ItemRands larger than 0) from AuctionDF refrences to the ID column of propertiesDF
propertiesDF = pl.DataFrame(propertiesData)

#All the various "Enchantment[X] columns from suffixDF and propertiesDF references to the ID column of the enchantmentDF"
enchantmentsDF = pl.DataFrame(enchantmentsData)

#The reason ItemName is blank for the ItemId 1482 is because it does not exist in the itemsDF
resultDF = pl.DataFrame(resultData)
print(resultDF)

The resultDF is the result I want to obtain.

So basically we have the mainDF "AuctionDF": ItemId references to the ID column in itemsDF where we will need the Display_lang in order to make the "ItemName" column in the resultDF ItemRand references to either the propertiesDF (if the ItemRand is positive) or the suffixDF (if the ItemRand is negative) In the propertiesDF we need to get the column "Name_lang" which gives us the column "EnchantmentName" in the resultDF Furthermore we also need to use the Enchantment[0-4] columns from propertiesDF, these columns contains an ID which references over to enchantmentsDF where we need the column Name_lang. In the resultDF you will notice there are 5x StatX columns (stat0-4), those contains the value of Name_lang from the EnchantmnetsDF Lastly if the ItemRand from AuctionsDF had been a negative value, we would have gone into the "suffixDF" as the negative ItemRand is a reference to the ID column in SuffixDF In SuffixDF you also find Enchantment[0-4] as we did in the "propertiesDF", and those also references to ID in enchantmentsDF. However you will fast notice the Name_lang for those values looks a bit differently, such as "+$i Intellect", this is because we have to calculate the $i value our selfs. In the SuffixDF there also are an additional 5 columns AllocationPct[0-4] all these values has to be divided by 10000 firstly, as an example then for each AllocationPct[0-4] which has a value, we multiply it with the SuffixFactor So for negative rands (SuffixDF) Enchantment0 and AllocationPct0 gives Stat0, Enchantment1 and AllocationPct1 gives Stat1 and so on

Example of item calucation of a negative rand: ItemId 24655 from AuctionDF has ItemRand -39 and SuffixFactor 50 We take the -39, lookup in SuffixDF (ID column), we see Enchantment0 = 2804 We lookup 2804 in the ID column in EnchantmentsDF which has the Name_lang = +$i Intellect Now we calculate $i by look at AllocationPct0 in the SuffixDF which is 5259 Then we divide 5259 by 10000 (5259/10000 = 0,5259) Now we multiply 0,5259 with the SuffixFactor (50) and use floor on it (0,5259*50 = 26,295 = floor(26,295) = 26) Now we have Stat0 = +26 Intellect

I might have made slight mistakes in the data as I typed all this out manually. If there are any questions or you think I might have made a mistake, feel free to ask.

Best regards


Solution

  • We'll take this in three steps.

    Step 1: Normalize and Stack propertiesDF and suffixDF

    When you see columns like Enchantment[0], Enchantment[1], etc. and lots of "0" or "null" values, you're essentially working with a database in non-normal form. This can make calculations awkward.

    So, as our first step, we'll use unpivot to put each Enchantment and Allocation in its own row for each ItemRand. In general, unvpiot is used to convert a "wide format" DataFrame (few rows, many columns) into a "long format" DataFrame (many rows, few columns).

    We'll use concat to horizontally stack the Enchantment and Allocation columns next to each other, and then use concat to vertically stack the propertiesDF and suffixDF DataFrames after they are unpivoted. We'll also change the sign of the ItemRand values in suffixDF.

    As a final step, we'll join with the enchantmentsDF - to prepare for calculating the Stat values for items that came from suffixDF in the next step.

    This step may seem intimidating at first glance, but you'll find that the code is rather repetitive: unpivotinging and stacking. I've laid out the code so that you can comment out sections and lines to following the development of the algorithm.

    You'll also note that I rename variables along the way, just to keep the code clean and understandable.

    In the end, you'll see a clean, tidy DataFrame derived from suffixDF, propertiesDF, and enchantmentsDF. This is the goal.

    enchantment_ids = (
        pl.concat(
            [
                (
                    pl.concat(
                        [
                            suffixDF.rename(
                                {f"Enchantment[{nbr}]": f"Stat{nbr}" for nbr in range(
                                    0, 5)}
                            )
                            .unpivot(
                                on=[f"Stat{nbr}" for nbr in range(0, 5)],
                                index=["ID", "Name_lang"],
                                value_name="EnchantmentID",
                                variable_name="Stat_Nbr",
                            ),
                            suffixDF.unpivot(
                                on=[
                                    f"AllocationPct[{nbr}]" for nbr in range(0, 5)],
                                value_name="Allocation",
                            ).drop("variable"),
                        ],
                        how="horizontal"
                    )
                    .filter(pl.col('EnchantmentID') != 0)
                    .select(
                        -pl.col("ID").alias("ItemRand"),
                        pl.col("Name_lang").alias("EnchantmentName"),
                        "EnchantmentID",
                        "Allocation",
                        "Stat_Nbr",
                    )
                ),
                (
                    propertiesDF.rename(
                        {f"Enchantment[{nbr}]": f"Stat{nbr}" for nbr in range(
                            0, 5)}
                    ).unpivot(
                        on=[f"Stat{nbr}" for nbr in range(0, 5)],
                        index=["ID", "Name_lang"],
                        value_name="EnchantmentID",
                        variable_name="Stat_Nbr",
                    )
                    .filter(pl.col('EnchantmentID') != 0)
                    .select(
                        pl.col("ID").alias("ItemRand"),
                        pl.col("Name_lang").alias("EnchantmentName"),
                        "EnchantmentID",
                        "Stat_Nbr",
                    )
                ),
            ],
            how="diagonal",
        )
        .sort("ItemRand", "Stat_Nbr")
        .join(
            enchantmentsDF.select(
                pl.col("ID").alias("EnchantmentID"),
                pl.col("Name_lang").alias("Enchantment_type"),
            ),
            how="left",
            on="EnchantmentID",
        )
    )
    enchantment_ids
    
    shape: (8, 6)
    ┌──────────┬─────────────────┬───────────────┬────────────┬──────────┬────────────────────────────┐
    │ ItemRand ┆ EnchantmentName ┆ EnchantmentID ┆ Allocation ┆ Stat_Nbr ┆ Enchantment_type           │
    │ ---      ┆ ---             ┆ ---           ┆ ---        ┆ ---      ┆ ---                        │
    │ i64      ┆ str             ┆ i64           ┆ f64        ┆ str      ┆ str                        │
    ╞══════════╪═════════════════╪═══════════════╪════════════╪══════════╪════════════════════════════╡
    │ -39      ┆ of the Invoker  ┆ 2804          ┆ 5259.0     ┆ Stat0    ┆ +$i Intellect              │
    │ -39      ┆ of the Invoker  ┆ 2824          ┆ 6153.0     ┆ Stat1    ┆ +$i Spell Power            │
    │ -39      ┆ of the Invoker  ┆ 2822          ┆ 5259.0     ┆ Stat2    ┆ +$i Critical Strike Rating │
    │ -19      ┆ of Intellect    ┆ 2804          ┆ 10000.0    ┆ Stat0    ┆ +$i Intellect              │
    │ 24       ┆ of Strength     ┆ 70            ┆ null       ┆ Stat0    ┆ +3 Strength                │
    │ 1020     ┆ of the Whale    ┆ 98            ┆ null       ┆ Stat0    ┆ +4 Spirit                  │
    │ 1020     ┆ of the Whale    ┆ 103           ┆ null       ┆ Stat1    ┆ +5 Stamina                 │
    │ 2032     ┆ of Healing      ┆ 2312          ┆ null       ┆ Stat0    ┆ +7 Spell Power             │
    └──────────┴─────────────────┴───────────────┴────────────┴──────────┴────────────────────────────┘
    

    Step 2: Calculate Stats and then pivot

    In the next step, we will calculate all Stats for the Items that had negative ItemRand. We'll need to join with the auctionDF to get the SuffixFactor for the calculations. You'll notice the familiar replace function.

    Once everything is calculated, we'll de-normalize the DataFrame using pivot. One way to think of pivot is that pivot is the opposite of unpivot. pivot converts a "long format" (many rows, few columns) to a "wide format" DataFrame (few rows, many columns.) pivot will create the Stat0, Stat1, and Stat2 columns.

    Note: Stat3 and Stat4 are not created because they are not needed. This reduces the width of your final DataFrame. (At the end, I'll show you how to keep Stat3 and Stat4 if you need them.)

    The result again is a neat, tidy DataFrame of Stats for each AuctionID.

    auction_enchantments = (
        auctionDF.select(
            "AuctionId",
            "ItemRand",
            "SuffixFactor",
        )
        .filter(pl.col("ItemRand").is_not_null())
        .join(enchantment_ids, on="ItemRand", how="left")
        .with_columns(
            pl.when(pl.col("Allocation").is_null())
            .then(pl.col("Enchantment_type"))
            .otherwise(
                pl.col("Enchantment_type").str.replace(
                    r"\$i",
                    (pl.col("Allocation") * pl.col("SuffixFactor") / 10_000)
                       .floor()
                       .cast(pl.Int64)
                       .cast(pl.Utf8),
                )
            )
        )
        .pivot(
            on="Stat_Nbr",
            index=["AuctionId", "EnchantmentName"],
            values="Enchantment_type",
        )
    )
    auction_enchantments
    
    shape: (5, 5)
    ┌────────────┬─────────────────┬────────────────┬─────────────────┬────────────────────────────┐
    │ AuctionId  ┆ EnchantmentName ┆ Stat0          ┆ Stat1           ┆ Stat2                      │
    │ ---        ┆ ---             ┆ ---            ┆ ---             ┆ ---                        │
    │ i64        ┆ str             ┆ str            ┆ str             ┆ str                        │
    ╞════════════╪═════════════════╪════════════════╪═════════════════╪════════════════════════════╡
    │ 2095293259 ┆ of the Invoker  ┆ +26 Intellect  ┆ +30 Spell Power ┆ +26 Critical Strike Rating │
    │ 2096131235 ┆ of Intellect    ┆ +36 Intellect  ┆ null            ┆ null                       │
    │ 2094319272 ┆ of Strength     ┆ +3 Strength    ┆ null            ┆ null                       │
    │ 2094265820 ┆ of Healing      ┆ +7 Spell Power ┆ null            ┆ null                       │
    │ 2094902378 ┆ of the Whale    ┆ +4 Spirit      ┆ +5 Stamina      ┆ null                       │
    └────────────┴─────────────────┴────────────────┴─────────────────┴────────────────────────────┘
    

    Step 3: Putting it all together

    Our goal all along was to create neat, tidy DataFrames so that this last step is simple. We simply join our auctionDF with our calculated enchantments from Step 2.

    Much of this step is just clean up: ordering the columns the way we want. (Note the slick use of regex expressions in col.) Also, we fill null values with empty strings "", for a tidier look.

    (
        auctionDF.join(
            itemsDF.rename({"ID": "ItemId", "Display_lang": "ItemName"}),
            on="ItemId",
            how="left",
        )
        .join(auction_enchantments, on="AuctionId", how="left")
        .with_columns(pl.col(pl.String).fill_null(""))
        .select(
            "AuctionId",
            "ItemId",
            "ItemName",
            "RealmName",
            "Faction",
            "EnchantmentName",
            pl.col("^Stat.*$"),
            pl.col("^Buyout.+$"),
            pl.col("^Bid.+$"),
        )
    )
    
    shape: (6, 15)
    ┌────────────┬────────┬──────────────────┬───────────┬──────────┬─────────────────┬────────────────┬─────────────────┬────────────────────────────┬────────────┬──────────────┬──────────────┬─────────┬───────────┬───────────┐
    │ AuctionId  ┆ ItemId ┆ ItemName         ┆ RealmName ┆ Faction  ┆ EnchantmentName ┆ Stat0          ┆ Stat1           ┆ Stat2                      ┆ BuyoutGold ┆ BuyoutSilver ┆ BuyoutCopper ┆ BidGold ┆ BidSilver ┆ BidCopper │
    │ ---        ┆ ---    ┆ ---              ┆ ---       ┆ ---      ┆ ---             ┆ ---            ┆ ---             ┆ ---                        ┆ ---        ┆ ---          ┆ ---          ┆ ---     ┆ ---       ┆ ---       │
    │ i64        ┆ i64    ┆ str              ┆ str       ┆ str      ┆ str             ┆ str            ┆ str             ┆ str                        ┆ str        ┆ str          ┆ str          ┆ str     ┆ str       ┆ str       │
    ╞════════════╪════════╪══════════════════╪═══════════╪══════════╪═════════════════╪════════════════╪═════════════════╪════════════════════════════╪════════════╪══════════════╪══════════════╪═════════╪═══════════╪═══════════╡
    │ 2095293259 ┆ 24655  ┆ Consortium Robe  ┆ Mograine  ┆ Horde    ┆ of the Invoker  ┆ +26 Intellect  ┆ +30 Spell Power ┆ +26 Critical Strike Rating ┆ 45         ┆ 40           ┆ 12           ┆ 39      ┆ 32        ┆ 21        │
    │ 2096131235 ┆ 24648  ┆ Astralaan Gloves ┆ Bloodfang ┆ Alliance ┆ of Intellect    ┆ +36 Intellect  ┆                 ┆                            ┆ 77         ┆ 44           ┆ 11           ┆ 12      ┆ 14        ┆ 12        │
    │ 2094319272 ┆ 3184   ┆ Hook Dagger      ┆ Firemaw   ┆ Alliance ┆ of Strength     ┆ +3 Strength    ┆                 ┆                            ┆ 24         ┆ 12           ┆ 21           ┆ 11      ┆ 44        ┆ 32        │
    │ 2094265820 ┆ 14187  ┆ Raincaller Cuffs ┆ Firemaw   ┆ Alliance ┆ of Healing      ┆ +7 Spell Power ┆                 ┆                            ┆ 39         ┆ 33           ┆ 52           ┆ 27      ┆ 12        ┆ 12        │
    │ 2094902378 ┆ 6580   ┆ Defender Tunic   ┆ Mograine  ┆ Horde    ┆ of the Whale    ┆ +4 Spirit      ┆ +5 Stamina      ┆                            ┆ 120        ┆ 12           ┆ 32           ┆ 99      ┆ 4         ┆ 45        │
    │ 2096005275 ┆ 1482   ┆                  ┆ Mograine  ┆ Horde    ┆                 ┆                ┆                 ┆                            ┆ 42         ┆ 51           ┆ 42           ┆ 23      ┆ 42        ┆ 33        │
    └────────────┴────────┴──────────────────┴───────────┴──────────┴─────────────────┴────────────────┴─────────────────┴────────────────────────────┴────────────┴──────────────┴──────────────┴─────────┴───────────┴───────────┘
    

    Other Notes

    The results are not exactly as your resultsDF. (But I think the calculated results above are correct.)

    There are quite a few advanced steps used in the algorithm above. But if you take your time and work through the algorithm, you'll understand what is happening.

    I've also provided links to Polars documentation for crucial methods (like pivot and unpivot).

    The replace step depends on Polars version 0.14.4 or above, so upgrade your version of Polars.

    If you need the Stat3 and Stat4 columns, just comment out the two lines that contain .filter(pl.col('EnchantmentID') != 0). This will cause the Stat3 and Stat4 columns to appear in the final step.