Search code examples
pythonpandaspython-re

Change last word after space in a dafaframe column


I am working on a data frame that contains computer names and I am trying to anonymize the computer names. Here is an example of the dataframe, I am working with

df = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'computer_name': [u'LENOVO 09 X32H0GB', u'LENOVO vmhsbpmh613.xyz.biz', u'Dell Inc. PowerEdge R910 XKF2S75', u'HP  ppesfesxb203.corp.123.com', 'IBM SoftLayer 13 L89P4567']})

Here is what it is required to anonymize it.

  1. Pick the first set of strings from the RIGHT after the first SPACE from the RIGHT .. eg : for "LENOVO vmhsbpmh613.xyz.biz" it would be "vmhsbpmh613.xyz.biz"

  2. After getting the first set of strings from the RIGHT eg "vmhsbpmh613.xyz.biz", remove all characters from the first Dot (.) , which would give "vmhsbpmh613" and if there are no Dot(.) then retain only the last set of string , Please note it is important to remove only the strings after dot (.) from first set of strings from the RIGHT, otherwise like in this example " Dell Inc. PowerEdge R910 XKF2S75 " it would result in removing everything after Dot " Dell Inc. "

  3. Lastly replace the first 3 characters with xxx , like xxxsbpmh613

Here is how the output should look like

df = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'computer_name': [u'LENOVO 09 xxxH0GB', u'LENOVO xxxsbpmh613', u'Dell Inc. PowerEdge R910 xxx2S75', u'HP  xxxsfesxb203', 'IBM SoftLayer 13 xxxP4567']})

I hope, I was able to articulate the requirement clearly, thanks.


Solution

  • Series.str.replace

    df['computer_name'].str.replace(r'\S{3}(\S+?)(?:\.\S+|$)', r'xxx\1')
    

    0                   LENOVO 09 xxxH0GB
    1                  LENOVO xxxsbpmh613
    2    Dell Inc. PowerEdge R910 xxx2S75
    3                    HP  xxxsfesxb203
    4           IBM SoftLayer 13 xxxP4567
    Name: computer_name, dtype: object
    

    Regex details

    • \S{3} : Matches any non-whitespace character extactly 3 times.
    • (\S+?) : Capturing group matches any non-whitespace character between 1 and unlimited times but as few times as possible (lazy match)
    • (?: : Begining of non-capturing group
    • \. : Matches . character
    • \S+ : Mathes any non-whitespace character
    • $ : Asserts position at the end of line
    • ) : Ending of non capturing group

    See the regex demo