Search code examples
pythonpython-3.xmetadatapipelinefeature-engineering

Dose meta data of data frame help build features for ML algorithms


Recently I was given a task by a potential employer to do the following :

- transfer a data set to S3
- create metadata for the data set
- creat a feature for the data set in spark

Now this is a trainee position, and I am new to data engineering in terms of concepts and I am having trouble understanding how or even if metadata is used to create a feature.

I have gone through numerous sites in feature engineering and metadata but none of which really give me an indication on if metadata is directly used to build a feature.

what I have gathered so far from sites is that when you build a feature it extracts certain columns from a given data set and then you put this information into a feature vector for the ML algorithm to learn from. So to me, you could just build a feature directly from the data set directly, and not be concerned with the metadata.

However, I am wondering if is it common to use metadata to search for given information within multiple datasets to build the feature, i.e you look in the metadata file see certain criteria that fit the feature your building and then load the data in from the metadata and build the feature from there to train the model.

So as an example say I have multiple files or certain car models for manufacture i.e (vw golf, vw fox, etc) and it contains the year and the price of the car for that year and I would like the ML algorithm to predict the depreciation of the car for the future or depreciation of the newest model of that car for years to come. Instead of going directly through all the dataset, you would check the metadata (tags, if that the correct wording) for certain attributes to train the model then by using the (tags) it loads the data in from the specific data sets.

I could very well be off base here, or my example I given above may be completely wrong, but if anyone could just explain how metadata can be used to build features if it can that would be appreactied or even if links to data engineering websites that explain. It just over the last day or two researching, I find that there more on data sic than data engineering itself and most data engineering info is coming from blogs so I feel like there a pre-existing knowledge I am supposed to have when reading them.

P.S though not a coding question, I have used the python tag as it seems most data engineers use python.


Solution

  • I'll give synopsis on this !!! Here we need to understand two conditions 1)Do we have features which directly related in building ML models. 2)are we in data scarcity ? Always make a question , what the problem statement suggest us in generating features ? There are many ways we can generate features from given dataset like PCA,truncated SVD,TSNE used for dimensionality reduction techniques where new features are created from given features.feature engineering techniques like fourier features,trignometric features etc. and then we move to the metadata like type of feature,size of feature,time when it extracted if it etc..like this metadata also helps us in creating features for building ML Models but this depends how we have performed feature engineering on datacorpus of respective Problem.