Search code examples
pythonpython-3.xpandastext-miningvin

Text mining a large list of Notes for Vehicle Identification Number (VIN#) with Python


I have a large data set of Insurance Claims data with 2 columns. One column is a claim identifier. The other is a large string of notes that go with the claim.

My goal is to text mine the Claims Notes for a specific VIN number. Typically a VIN# is in a 17 digit format. See Below: https://www.autocheck.com/vehiclehistory/autocheck/en/vinbasics

However, with my data, some issues arise. Sometimes only the last 6 digits were input for a VIN#. I basically need a way to process my data and grab anything that looks like a 17 digit VIN Number and return it to that row of data. I am using Python 3 and am a rookie text miner but have some basic experience using regular expressions.

I am trying to create a function in python in which I can lambda apply it to the column of notes.

Attempt so far:

C_Notes['VIN#s'] = C_Notes['ClaimsNotes'].str.findall(r'[0-9]{1}[0-9a-zA-Z]{16}') 

I am trying to mimic the format of the VIN in the link I provided.

So something that looks for a string with following qualities:

EDIT: Changed code snippet. This code example works if I make some toy examples of VINs with made up text but I am not having any success iterating through my pandas column. Each row entry has a large paragraph of text I want the function to go through each row at a time.

Thank you.


Solution

  • But with which VIN system exactly you are dealing?

    Wikipedia has article describing 17-digit VIN number, describing three different systems: ISO 3779, European and North American.

    Apparently there is not general formal rule dictating what (onlyletter/onlydigit/letterordigit) occupies which position.

    First 3 characters depend on manufacturer country/region and first of them is digit for Oceania and both Americas but letter for everyone else.

    For North American 9th, 13th, 14th, 15th, 16th and 17th positions are always digits and letters I,O,Q are never used.

    Taking in account above considerations following pattern could be used:

    [0-9][0-9A-Za-z^IiOoQq]{7}[0-9][0-9A-Za-z^IiOoQq]{3}[0-9]{5}
    

    Which consist of digits ([0-9]) and letterordigits but which are not prohibited characters. ^ inside [] mean that blacklist of characters will follow.

    As more general note I advise against guessing regular expression based on limited subset of legal strings.