Search code examples
pythonpandassequencegenerate

IS it possible to optimise this sequence/code generation within the given format/syntax constraints?


I'm attempting to generate a code, some of which has to follow certain predefined rules(see commentary). I only need as many as there are df rows passed in - the code is assigned back to that same df on a simple per-row basis. Returning a list 'seems' less ideal than assigning directly to the df within the function, but i've not been able to achieve this. Unfortunately i need to pass in 3 df's separately due to other constraints in processing elsewhere, but each time they will have a different single character suffix(e.g. X|Y|Z). The codes do not 'need' to be sequential between the different df's, although having some sequencing in for each could be useful...and is the way i've attempted thus far.

However, my current 'working' attempt here, though functional....takes far too long. I am hopeful that someone can point out some possible wins for optimising any part of this. Typically each df is <500k, more usually 100-200k.

Generate an offer code

Desired outcome:

Sequence that takes the format: YrCodeMthCode+AAAA+99+[P|H|D] Where:

  • YrCode and Mth code are supplied*
  • AAAA a generated psuedo unique char sequence*
  • 99 should not contain zeros, and is always 2 digits* (Any, Incl non-sequential)
  • P|H|D is a defined identifier argument, must be passed in
  • Typically the df.shape[0] dimensions are never more than 65. But happy to create blank/new and merge with existing if faster.

*The uniqueness of YrCodeMthCode+AAA+99 only needs to cover 500k records each month(as MthCode will change/refresh x12)

import numpy as np
import pandas as pd

np.random.seed(0)
df = pd.DataFrame(np.random.randn(200, 3), columns=list('ABC'))



offerCodeLength = 6
allowedOfferCodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

campaignMonth = 'January'
campaignYear = 2021

yearCodesDict = {2021:'G',2022:'H',2023:'I', 2024:'J', 2025:'K', 2026:'L', 2027:'M'}
monthCodesDict = {'January':'A','Febuary':'B','March':'C',
                  'April':'D','May':'E','June':'F',
                  'July':'G', 'August':'H','September':'I',
                  'October':'J','November':'K','December':'L'}


OfferCodeDateStr = str(yearCodesDict[campaignYear])+str(monthCodesDict[campaignMonth])

iterator = 0
breakPoint = df.shape[0]



def generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, OfferCodeSuffix):
    
    allowedOfferCodeChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    
    iterator = 0 # limit amount generated
    offerCodesList = []
    for item in itertools.product(allowedOfferCodeChars, repeat=offerCodeLength): 
        
        # generate a 2 digit number, with NO zeros (to avoid 0 vs o call centre issues)
        psuedoRandNumc = str(int(''.join(random.choices('123456789',k=randint(10,99))))%10**2)
        
        if iterator < breakPoint: # breakpoint as length of associated dataframe/number of codes required
            OfferCodeString = "".join(item)
            OfferCodeString = OfferCodeDateStr+OfferCodeString+psuedoRandNum+OfferCodeSuffix # join Yr,Mth chars to generated rest

            offerCodesList.append(OfferCodeString) 

            iterator +=1 

    return offerCodesList

generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, 'P')
  • Pretty sure this is less than ideal as'k=randint(10,99))))%10**2' but unsure as to how to better optimise....sliced string?
  • I'm only defining the breakpoint outside as when i used .shape[0] directly it appeared even slower.
  • I'm aware that my loop use is probably poor, and there has to be a more vectorised solution in only creating what i need and applying it directly back to the passed df.

Example timings on mine: (OffercodeLength set to just 4) x100 : 5.99 s ± 227 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Wall time: 47.5 s

x1000 : 5.87 s ± 243 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Wall time: 46.4 s


Solution

  • IIUC, you could try:

    def generateOfferCode(OfferCodeDateStr, offerCodeLength, breakPoint, offerCodeSuffix):
        seen = set()
        offerCodesList = list()
        
        for i in range(breakPoint):
            psuedoRandNum = ''.join(random.choices('123456789', k=2))
            OfferCodeString = "".join(random.choices("ABCDEFGHIJKLMNOPQRSTUVWXYZ", k=6))
            while OfferCodeString in seen:
                OfferCodeString = "".join(random.choices("ABCDEFGHIJKLMNOPQRSTUVWXYZ", k=6))
            seen.add(OfferCodeString)
            offerCodesList.append(f"{OfferCodeDateStr}{OfferCodeString}{psuedoRandNum}{offerCodeSuffix}")
        return offerCodesList
    
    df["offerCode"] = generateOfferCode(YrCodeMthCode, 6, df.shape[0], 'P')
    
    >>> df
                A         B         C    offerCode
    0    1.764052  0.400157  0.978738  GAZGCPGE28P
    1    2.240893  1.867558 -0.977278  GADYNNWU69P
    2    0.950088 -0.151357 -0.103219  GAEQUFPI48P
    3    0.410599  0.144044  1.454274  GAUCSCHW76P
    4    0.761038  0.121675  0.443863  GAFMVTBP28P
    ..        ...       ...       ...          ...
    195 -0.470638 -0.216950  0.445393  GAOXGTOU88P
    196 -0.392389 -3.046143  0.543312  GAXPQOFI25P
    197  0.439043 -0.219541 -1.084037  GACBKIJV93P
    198  0.351780  0.379236 -0.470033  GAVYQEQL46P
    199 -0.216731 -0.930157 -0.178589  GALNKYVE23P
    
    Performance
    >>> %timeit generateOfferCode(YrCodeMthCode, 6, 500000, 'P')
    %timeit generateOfferCode(YrCodeMthCode, 6, df.shape[0]*1000, 'P')
    829 ms ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)