Feature Generation: what it is and how to do it?​

Generating high quality features that can improve your ML model performance.

Slide Background

In our everyday life we are faced with decisions. One of the reasons why we struggle to take a decision is because, most of the time, it involves more than one objective. For instance, when buying a car, it isn’t just about buying the best car; but about buying a car that you can afford, is the right size for you, the right colour, doesn’t consume too much, it’s environmentally friendly etc. But each time you find a car that fulfils some of these criteria, it seems to lack on the other ones.

Feature Generation

Before we get into the details let’s review what a feature is. A feature (or column) represents a measurable piece of data like name, age or gender.It is the basic building block of a dataset. The quality of a feature can vary significantly and has an immense effect on model performance. We can improve the quality of a dataset’s features in the pre-processing stage using processes like Feature Generation and Feature Selection.

Feature Generation (also known as feature construction, feature extraction or feature engineering) is the process of transforming features into new features that better relate to the target. This can involve mapping a feature into a new feature using a function like log, or creating a new feature from one or multiple features using multiplication or addition.

Data Figure 1. Feature Generation | Image by Jonas Meier ​

Feature Generation can improve model performance when there is a feature interaction. Two or more features interact if the combined effect is (greater or less) than the sum of their individual effects. It is possible to make interactions with three or more features, but this tends to result in diminishing returns.

Feature Generation is often overlooked as it is assumed that the model will learn any relevant relationships between features to predict the target variable. However, the generation of new flexible features is important as it allows us to use less complex models that are faster to run and easier to understand and maintain.

Feature Selection

In fact, not all features generated are relevant. Moreover, too many features may adversely affect the model performance. This is because as the number of features increases, it becomes more difficult for the model to learn mappings between features and target (this is known as the curse of dimensionality). Thus it is important to select the most useful features through Feature Selection, which we will further introduce in our next blog.

Data Figure 2. Difference between feature selection and feature extraction | Image by Abhishek Singh ​

Examples of Feature Generation techniques

A transformation is a mapping that is used to transform a feature into a new feature. The right transformation depends on the type and structure of the data, data size and the goal. This can involve transforming single feature into a new feature using standard operators like log, square, power, exponential, reciprocal, addition, division, multiplication etc.

Often the relationship between dependent and independent variables are assumed linear, but this is not always the case. There are feature combinations that cannot be represented by a linear system. A new feature can be created based on a polynomial combination of numeric features in a dataset. Moreover, new features can be created using trigonometric combinations.

Manual vs Automated feature generation

Feature Generation was an ad-hoc manual process that depended on domain knowledge, intuition, data exploration and creativity. However, this process is dataset-dependent, time-consuming, tedious, subjective, and it is not a scalable solution. Automated Feature Generation automatically generates features using a framework; these features can be filtered using Feature Selection to avoid feature explosion. Below you can find some popular open source libraries for automated feature engineering:

Optimise feature generation with EvoML

However, open source libraries may not provide the customisation you need for your unique data science projects. With EvoML , you can customise the automated feature generation process, get better features for better model results, faster.

EvoML can automatically transform, select and generate the most suitable features depending on the characteristics of the dataset. Our data scientists have integrated all the common feature generation methods as well as each method’s best practices (what types of dataset the method is most suitable for) into EvoML. Given a dataset, EvoML automatically tries different combinations of feature generation methods and selects the best ones. Furthermore, EvoML gives users the flexibility to choose which methods they prefer, so that they can easily customise it to their needs.

Figure 3. Data feature analysis on EvoML platform

Conclusion

Feature Generation involves creating new features which can improve model accuracy. In addition to manual processes, there are several frameworks that can be used to automatically generate new features that expose the underlying problem space. These features can subsequently be filtered using Feature Selection, to ensure that only a subset of the most important features is used. This process effectively reduces model complexity and improves model accuracy as well as interpretability.

About the Author

Dr Manal Adham​| TurinTech Research Team

Passionate about bridging the link between research in AI and real-world problems. I love to travel around the world and collect different experiences.