Consider “apparent temperature” measures like the heat index and the wind chill. These quantities attempt to measure the perceived temperature to humans based on air temperature, humidity, and wind speed, things which we can measure directly. You could think of an apparent temperature as the result of a kind of feature engineering, an attempt to make the observed data more relevant to what we actually care about: how it actually feels outside!
Feature ratio = amount of new features compared to the initial amount of features.
Technical note: What we’re calling uncertainty is measured using a quantity from information theory known as “entropy”. The entropy of a variable means roughly: “how many yes-or-no questions you would need to describe an occurance of that variable, on average.” The more questions you have to ask, the more uncertain you must be about the variable. Mutual information is how many questions you expect the feature to answer about the target.
Tips on Discovering New Features
- Understand the features. Refer to your dataset’s data documentation, if available.
- Research the problem domain to acquire domain knowledge. If your problem is predicting house prices, do some research on real-estate for instance. Wikipedia can be a good starting point, but books and journal articles will often have the best information.
- Study previous work. Solution write-ups from past Kaggle competitions are a great resource.
- Use data visualization. Visualization can reveal pathologies in the distribution of a feature or complicated relationships that could be simplified. Be sure to visualize your dataset as you work through the feature engineering process.
Tips on Creating Features
It’s good to keep in mind your model’s own strengths and weaknesses when creating features. Here are some guidelines:
- Linear models learn sums and differences naturally, but can’t learn anything more complex.
- Ratios seem to be difficult for most models to learn. Ratio combinations often lead to some easy performance gains.
- Linear models and neural nets generally do better with normalized features. Neural nets especially need features scaled to values not too far from 0. Tree-based models (like random forests and XGBoost) can sometimes benefit from normalization, but usually much less so.
- Tree models can learn to approximate almost any combination of features, but when a combination is especially important they can still benefit from having it explicitly created, especially when data is limited.
- Counts are especially helpful for tree models, since these models don’t have a natural way of aggregating information across many features at once.
Clustering acts like a traditional "binning" or "discretization" transform. On multiple features, it's like "multi-dimensional binning" (sometimes called vector quantization).
max_iter
=iteration of building each cluster, n_clusters
=how many clusters/centroid, n_init
=number of iteration where centroid moving?there is no best or one-for-all solution, it’s all depends on the model algorithm we’re using and what we’re trying to predict. The best approach is through hyperparameter tuning (cross-validation).
Making decision of whether data needs to rescale or not, depends on some domain knowledge of the data, usually it comes spontaneously, instinctively.
we can also use PCA as a features, for example dimensionality reduction, anomaly detection, and many more.
encoding = weight * in_category + (1 - weight) * overall
n
is the occurrence of the category appears in data, and m
is our smoothing factor that we define by ourself. Larger values of m
put more weight on the overall estimate. When choosing a value for m
, consider how noisy you expect the categories to be.
weight = n / (n + m)
chevrolet = 0.6 * 6000.00 + 0.4 * 13285.03
Notes:
Questions: