Search code examples
pythonpcafeature-extractiontsfresh

How many features are needed to extract for PCA using python tsfresh


I'm working on a project using gas sensory data from a gas sensor array (15 sensors). When I sample gas this instrument creates a time series with the value of the resistance measured across each sensor. Since I wanted to do explanatory data analysis, to see if this instrument can separate different samples, I did feature extraction from the time series curves using Python library tsfresh.

Tsfresh generates a lot of features for every time series (around 800 I think) using different methods (steady state, fourier transform, etc...)...

My question are:

Is it ok if I use pca for dimensionality reduction on all these features, or should I select a subset of feature to use prior to the analysis?

How this large amount of features can affect the results on the pca plot?

Here is the exact function I used:

from tsfresh import extract_features

extracted_features = extract_features(joined_dfs_final, column_id="id", column_sort="time")

Solution

  • Summary: just try PCA and see if it helps you. There are many good guides, but if you are new to any data/computing method, I will always recommend looking for a guide on Machine Learning Mastery - here is the guide to Principal Components Analysis

    Body: There are two solid ways of thinking about this question.

    Philosophy 1) Deliberate naivety - the methods are powerful, just apply them and see if they give you a result that helps you

    Philosophy 2) Thoughtful work - think through the relevance and assumptions of each method you use and their suitability for your use case, then apply only the valuable methods, or tailor the settings to your use case.

    It is very easy to think that Philosophy 2 is the best way, but the big success of data science is that many of the methods do work fairly well, even if they are a poor fit for the assumptions of the method applied.

    Directly answering your questions:

    Is it ok if I use PCA for dimensionality reduction on all these features, or should I select a subset of feature to use prior to the analysis?

    PCA is designed to reduce the number of features. 300 is a large number of features for a human, but not necessarily large for datasets more generally. OpenAI uses approximately 100k features to encode English language prompts.

    How this large amount of features can affect the results on the pca plot?

    Getting PCA to work on your data is easier than understanding the effects. If you can reduce your large number of dimensions to 2 or 3 dimensions then you can visualise the output with a graph. Any more than 3 dimensions, and a PCA plot like the one below will merely show you 2 dimensions out of all the output dimensions. The human brain is not designed to think of things in more than 3 dimensions. There are many measures for examining the quality of your output model though - if you are doing some unsupervised learning (also called clustering) then you might use silhouette score to tell you how good your clusters are.

    You might be trying to classify the output to some limited number of real world scenarios eg using your gas sensor figure out which of 32 weather categories are currently happening, or you might have some single value metric like the cleanliness of air output by an air purifier. There are scores for these scenarios too.