I'm trying to build a Python machine learning model that can predict column names of unseen column data, based on previous datasets. For a simplified example a training dataframe can look like:
Currency | Security Number |
---|---|
USD | 000402625 |
CAD | 001477825 |
USD | 200398025 |
USD | 000403458 |
JPY | 099402464 |
EUR | 458592625 |
where the model would find a way to distinguish currencies from security numbers, and then feeding this test dataframe to the model:
X | Y |
---|---|
CAD | 500235025 |
CAD | 200394855 |
EUR | 999398025 |
EUR | 234890578 |
USD | 980758345 |
JPY | 123754890 |
would identify column X = Currency and column Y = Security Number
I've did research and couldn't find anything that would allow predictions based on full column data, any help would be appreciated.
Since all the possible currencies are known you can get 100% accuracy by simply checking from a known list instead of making a prediction with a model.
But generally speaking, you can put all your data into one huge excel sheet, each row has a value and label. Then you shuffle your rows to make it random, and then you can train the whole thing.
Value | Label |
---|---|
USD | Currency |
001477825 | Security Number |
000403458 | Security Number |
EUR | Currency |
If you add enough data you should be able to predict that "BLA" is a currency and that 349834989 is a security number. Both are not correct but should be close enough to what you need :) This is what happens if you use machine learning :)
BUT
You will run into problems if you have several columns that all have numbers. In that case, the numbers need to have a pattern that can be associated with that column. That might simply not be the case.