I'm attempting to generate a code, some of which has to follow certain predefined rules(see commentary). I only need as many as there are df rows passed in - the code is assigned back to that same df on a simple per-row basis. Returning a list 'seems' less ideal than assigning directly to the df within the function, but i've not been able to achieve this. Unfortunately i need to pass in 3 df's separately due to other constraints in processing elsewhere, but each time they will have a different single character suffix(e.g. X|Y|Z). The codes do not 'need' to be sequential between the different df's, although having some sequencing in for each could be useful...and is the way i've attempted thus far.
However, my current 'working' attempt here, though functional....takes far too long. I am hopeful that someone can point out some possible wins for optimising any part of this. Typically each df is <500k, more usually 100-200k.
Sequence that takes the format: YrCodeMthCode+AAAA+99+[P|H|D] Where:
*The uniqueness of YrCodeMthCode+AAA+99 only needs to cover 500k records each month(as MthCode will change/refresh x12)
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(200, 3), columns=list('ABC'))
offerCodeLength = 6
allowedOfferCodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
campaignMonth = 'January'
campaignYear = 2021
yearCodesDict = {2021:'G',2022:'H',2023:'I', 2024:'J', 2025:'K', 2026:'L', 2027:'M'}
monthCodesDict = {'January':'A','Febuary':'B','March':'C',
'April':'D','May':'E','June':'F',
'July':'G', 'August':'H','September':'I',
'October':'J','November':'K','December':'L'}
OfferCodeDateStr = str(yearCodesDict[campaignYear])+str(monthCodesDict[campaignMonth])
iterator = 0
breakPoint = df.shape[0]
def generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, OfferCodeSuffix):
allowedOfferCodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
iterator = 0 # limit amount generated
offerCodesList = []
for item in itertools.product(allowedOfferCodeChars, repeat=offerCodeLength):
# generate a 2 digit number, with NO zeros (to avoid 0 vs o call centre issues)
psuedoRandNumc = str(int(''.join(random.choices('123456789',k=randint(10,99))))%10**2)
if iterator < breakPoint: # breakpoint as length of associated dataframe/number of codes required
OfferCodeString = "".join(item)
OfferCodeString = OfferCodeDateStr+OfferCodeString+psuedoRandNum+OfferCodeSuffix # join Yr,Mth chars to generated rest
offerCodesList.append(OfferCodeString)
iterator +=1
return offerCodesList
generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, 'P')
Example timings on mine: (OffercodeLength set to just 4) x100 : 5.99 s ± 227 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Wall time: 47.5 s
x1000 : 5.87 s ± 243 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Wall time: 46.4 s
IIUC, you could try:
def generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, offerCodeSuffix):
seen = set()
offerCodesList = list()
for i in range(breakPoint):
psuedoRandNum = ''.join(random.choices('123456789', k=2))
OfferCodeString = "".join(random.choices("ABCDEFGHIJKLMNOPQRSTUVWXYZ", k=6))
while OfferCodeString in seen:
OfferCodeString = "".join(random.choices("ABCDEFGHIJKLMNOPQRSTUVWXYZ", k=6))
seen.add(OfferCodeString)
offerCodesList.append(f"{OfferCodeDateStr}{OfferCodeString}{psuedoRandNum}{offerCodeSuffix}")
return offerCodesList
df["offerCode"] = generateOfferCode(YrCodeMthCode, 6, df.shape[0], 'P')
>>> df
A B C offerCode
0 1.764052 0.400157 0.978738 GAZGCPGE28P
1 2.240893 1.867558 -0.977278 GADYNNWU69P
2 0.950088 -0.151357 -0.103219 GAEQUFPI48P
3 0.410599 0.144044 1.454274 GAUCSCHW76P
4 0.761038 0.121675 0.443863 GAFMVTBP28P
.. ... ... ... ...
195 -0.470638 -0.216950 0.445393 GAOXGTOU88P
196 -0.392389 -3.046143 0.543312 GAXPQOFI25P
197 0.439043 -0.219541 -1.084037 GACBKIJV93P
198 0.351780 0.379236 -0.470033 GAVYQEQL46P
199 -0.216731 -0.930157 -0.178589 GALNKYVE23P
>>> %timeit generateOfferCode(YrCodeMthCode, 6, 500000, 'P')
%timeit generateOfferCode(YrCodeMthCode, 6, df.shape[0]*1000, 'P')
829 ms ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)