I haven't been able to find the solution to the loop part of the following. I have a data frame with over 500K of rows. I want to write a random combination of letters and numbers in a column we'll call "ProductID". I found solutions here that let me write simple numbers, which work, even if they're painfully slow. For example:
for index, row in df3.iterrows():
df3['ProductID'] = np.arange(1,551586)
I have also found the code on this site to produce a random sequence, and each time I run it, it dutifully produces a new string:
import string
import random
def id_generator(size=12, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
# df3['ProductID'] = id_generator()
i = 0
while i < 6:
print(id_generator())
i = i + 1
Output:
7JKD7LWUZPHC
1ETULSX4WRJI
B42TSN4SFC20
RYIDD7N2RPI2
8GEMULEC7TX1
0FGZZQLBF0XE
What I can't seem to do is write that string to each cell in a new column as described above.
My apologies, I cannot find where I found it exactly. However, when I try to enclose it in a loop, like so, it takes the first string generated and simply duplicates it:
for index, row in df3.iterrows():
df3['ProductID'] = id_generator()
The same thing happens if I use a simple while
loop.
Current output:
+---------------------------------------------------+---------------+------------------+---------+---------------+--------------------+------------------+--------------+
| name | main_category | sub_category | ratings | no_of_ratings | discount_price_USD | actual_price_USD | ProductID |
+---------------------------------------------------+---------------+------------------+---------+---------------+--------------------+------------------+--------------+
| Lloyd 1.5 Ton 3 Star Inverter Split Ac (5 In 1... | appliances | Air Conditioners | 4.2 | 2255 | 402.5878 | 719.678 | HP2ISWKAI7CA |
| LG 1.5 Ton 5 Star AI DUAL Inverter Split AC (C... | appliances | Air Conditioners | 4.2 | 2948 | 567.178 | 927.078 | HP2ISWKAI7CA |
| LG 1 Ton 4 Star Ai Dual Inverter Split Ac (Cop... | appliances | Air Conditioners | 4.2 | 1206 | 420.778 | 756.278 | HP2ISWKAI7CA |
| LG 1.5 Ton 3 Star AI DUAL Inverter Split AC (C... | appliances | Air Conditioners | 4 | 69 | 463.478 | 841.678 | HP2ISWKAI7CA |
| Carrier 1.5 Ton 3 Star Inverter Split AC (Copp... | appliances | Air Conditioners | 4.1 | 630 | 420.778 | 827.038 | HP2ISWKAI7CA |
+---------------------------------------------------+---------------+------------------+---------+---------------+--------------------+------------------+--------------+
Expected output:
+---------------------------------------------------+---------------+------------------+---------+---------------+--------------------+------------------+--------------+
| name | main_category | sub_category | ratings | no_of_ratings | discount_price_USD | actual_price_USD | ProductID |
+---------------------------------------------------+---------------+------------------+---------+---------------+--------------------+------------------+--------------+
| Lloyd 1.5 Ton 3 Star Inverter Split Ac (5 In 1... | appliances | Air Conditioners | 4.2 | 2255 | 402.5878 | 719.678 | HP2ISWKAI7CA |
| LG 1.5 Ton 5 Star AI DUAL Inverter Split AC (C... | appliances | Air Conditioners | 4.2 | 2948 | 567.178 | 927.078 | 7JKD7LWUZPHC |
| LG 1 Ton 4 Star Ai Dual Inverter Split Ac (Cop... | appliances | Air Conditioners | 4.2 | 1206 | 420.778 | 756.278 | 1ETULSX4WRJI |
| LG 1.5 Ton 3 Star AI DUAL Inverter Split AC (C... | appliances | Air Conditioners | 4 | 69 | 463.478 | 841.678 | B42TSN4SFC20 |
| Carrier 1.5 Ton 3 Star Inverter Split AC (Copp... | appliances | Air Conditioners | 4.1 | 630 | 420.778 | 827.038 | RYIDD7N2RPI2 |
+---------------------------------------------------+---------------+------------------+---------+---------------+--------------------+------------------+--------------+
I'm clearly doing something wrong, but I can't figure out what.
The reason why you are getting the same value in the Product ID
column in this code:
for index, row in df3.iterrows():
df3['ProductID'] = id_generator()
is because it is applying the value of id_generator()
to the entire column and not to each cell. So what you are left with is whatever the last value was for id_generator()
when the for loop finished.
One possible solution to this problem is instantiating the Product ID
column first with NaN
values. Modifying your def id_generator()
function so that apply()
can be used on it. Here is what that would look like:
# Just added cell_val as part of the arguments
def id_generator(cell_val , size=12, chars=string.ascii_uppercase + string.digits):
cell_val = ''.join(random.choice(chars) for _ in range(size))
return cell_val
# instantiate product id col with nan
df3['ProductID'] = np.nan
# apply your function to product id col
df3['ProductID'] = df3['ProductID'].apply(id_generator)