特徵工程的五種類型與四種衡量

紫式晦澀每日一篇文章第44天

前言

今天是2022年第42天, 全年第6週, 二月的第2個週五. 今天來增加關於「特徵工程(Feature engineering)」的知識.
今天的素材主要來自文章:

2022:Feature Engineering Bookcamp 的節2.3與2.4.

代碼: FEB (Feature Engineering Bookcamp)

特徵工程任務	核心精神	實務技巧
特徵改良 - Feature Improvement	利用數學轉換讓存在的特徵更好用 making existing features more usable through mathematical transformations	設算(Imputation; filling in missing data). 縮放(Scaling). 正規化(Normalization).
特徵構建 - Feature Construction	從原存在可解釋的特徵, 創造新可解釋特增, 以擴增資料集. Augmenting the dataset by creating net new interpretable features from existing interpretable features.	乘除特徵, 加入新資料集 Multiplying/Dividing features together Joining with a new dataset
特徵選取 - Feature Selection	從已有的特徵, 選取最好的子集 Choosing the best subset of features from an existing set of features	假設檢定, 遞歸特徵刪除 Hypothesis testing Recursive Feature Elimination
特徵提取 - Feature Extraction	利用對資料的參數假設與算法自動創造新的, 可能無法解釋的特徵 Relying on algorithms to automatically create new sometimes uninterpretable features usually based on making parametric assumptions about the data	主成分分析 Principal Component Analysis, 奇異值分解 Singular Value Decomposition
特徵學習 - Feature Learning	從無結構資料輸出結構與學習表達式, 自動生成新的特徵集合. Automatically generating a brand new set of features usually by extracting structure and learning representations from raw unstructured data	生成對抗網路 Generative Adversarial Networks, 自編碼器 Autoencoders, 受限波茲曼機器 Restricted Boltzmann Machines

特徵工程度量	核心精神
機器學習度量 - Machine Learning Metrics
可解釋性- Interpretability
公平與偏誤 - Fairness and Bias
機器學習複雜度與速度 - ML complexity and speed

五種特徵工程任務

資料有結構, 用數學轉換做特徵工程
資料無結構, 用深度學習做特徵工程.

任務一: 特徵改良 (Feature Improvement)

FEB645 特徵改良-轉換結構特徵-設算標準化正規化

特徵改良: 藉由轉換來擴增存在的「結構特徵(Structured features)」
心法: 對數值特徵作數學轉換
改良流程: 設算(Imputation)–>標準化–>正規化

Feature improvement techniques deal with augmenting existing structured features through various transformations (figure 2.8). This generally takes the form of applying transformations to numerical features. Common improvement steps are imputing (filling in) missing data values, standardization, and normalization. We will start to dive into these and other feature improvement techniques in the first case study.

貝氏方法, 可以做非常多的「設算」而且有各種解釋性以及統計性質.

貝氏方法, 可以做非常多的「設算」而且有各種解釋性以及統計性質.

FEB646 補遺失值:最近鄰居法補nominal, ordinal; 總結統計量補interval, ratio.

根據問題的資料等級, 會有不同類型的特徵改良.
例如:有個特徵有遺失值, 則可以用「設算」法來做.
nominal與ordinal補值: 用「最近鄰居算法」基於「其他特徵」來「預測遺失值」
interval與ratio補值: 用「畢氏平均」或「中位數」來「預測遺失值」
當有很多離群值, 可以用中位數替代算術平均

Going back to our levels of data, the type of feature improvement we are allowed to perform depends on the level of data that the feature in question lives in. For example, let’s say we are dealing with a feature that has missing values in the dataset. If we are dealing with data at the nominal or ordinal level then we can impute - fill in - missing values by using the most common value (the mode) of that feature or by using the nearest neighbor algorithm to “predict” the missing value based on other features. If the feature lives in the interval or ratio level, then we can impute using one of our Pythagorean means or perhaps using the median. In general, if our data has a lot of outliers, we would rather use the median (or the geometric/harmonic mean if appropriate) and we would use the arithmetic mean if our data didn’t have as many outliers.

總結統計量: 補各種數值資料
最近鄰居算法: 補名目與順序資料

FEB647 特徵改良使用時機: 有遺失值, 有離群值, 都會讓機器學習模型使用發生問題.

特徵有遺失值導致機器學習模型無法用
特徵有離群值導致機器學習模型表現下降

We want to perform feature improvement when:

Features that we wish to use are unusable by an ML model (has missing values for example)
Features have outrageous outliers that may affect the performance of our ML model

Privacy也是一種「特徵缺陷」, 讓機器學習模型表現下降, 這就要看怎麼處理.

任務二: 特徵構建 (Feature Construction)

FEB681 特徵構建:轉換存在特徵, 加入新資料來源

直接構造新的特徵
可以轉換存在的特徵
可以加入新的資料來源

Feature construction is all about manually creating new features by directly transforming existing features or joining the original data with data from a new source (figure 2.9).

很工程！

FEB628 特徵構建以獲得更多訊號, 以利執行機器學習任務

原數據集的訊號不夠執行機器學習任務
轉換後的特徵, 提供更多的訊息
品質變數轉為量化特徵

We want to perform feature construction when:

Our original dataset does not have enough signal in it to perform our ML task
A transformed version of one feature has more signal than its original counterpart (we will see an example of this in our healthcare case study)
We need to map qualitative variables into quantitative features

竟然可以靠轉換特徵得到更多「signal」? 不曉得實際是怎麼做到.

FEB629 特徵構建依賴領域知識, 得到有意義的特徵, 增強資料集的訊號, 以利機器學習任務.

特徵構建依賴「領域知識(Domain knowledge)」
需要對具體問題有很深的了解, 就可以造出有意義的特徵, 增加訊號, 完成機器學習任務.

Feature construction is often laborious and time-consuming as it is the type of feature engineering that demands the most domain knowledge. It is virtually impossible to hand-craft features without a deep understanding of the underlying problem domain.

任務三: 特徵選取 (Feature Selection)

FEB624 特徵選取挑選最佳特徵, 降低機器學習模型維度, 減少混淆特徵, 以避免表現下降.

不是所有特徵對機器學習任務都有幫助
特徵選取: 從存在的的特徵集合, 挑選最佳的特徵, 「減低特徵數目」與「減低特徵之間的相關性」.
如果特徵之間有相關性, 就會出現「混淆特徵(Confounding features)」使得機器學習表現下降.

Not all features are equally useful in an ML task. Feature selection involves picking and choosing the best features from an existing set of features to reduce both the total number of features that the model needs to learn from as well as the chance that we encounter a case where features are dependent on one another (figure 2.11). If the latter occurs, we are faced with possibly confounding features in our model which often leads to poorer overall performance.

有道理! 大量的特徵變數, 特徵之間有強相關性, 解釋上很困難.
然而, 利用神經網路創造出來的特徵, 直覺上不是互相有很強的相關性?
相關性要多強才是不好?

FEB625 特徵選取以避免維度詛咒, 相依特徵, 並訓練提速

維度詛咒:對比有的rows, 有太多columns了
相依特徵: 違反機器學習模型假設的特徵獨立
訓練提速: 維度小可以確保訓練機器學習的時間加速

We want to perform feature selection when:

We are face to face with the curse of dimensionality and we have too many columns to properly represent the number of observations in our dataset
Features exhibit dependence amongst each other. If features are dependent on one another, then we are violating a common assumption in ML that our features are independent
The speed of our ML model is important. Reducing the number of features our ML model has to look at generally reduces complexity and increases the speed of the overall pipeline

言之有理！

任務四: 特徵提取 (Feature Extraction)

FEB

Feature extraction automatically creates new features based on making assumptions about the underlying shape of the data. Examples of this include applying linear algebra techniques to perform Principal Component Analysis (PCA) and Singular Value Decomposition (SVD). We will cover these concepts in our NLP case study. The key here is that any algorithm that fits under feature extraction is making an assumption about the data that, if untrue, may render the resulting dataset less useful than its original form.

FEB

A common feature extraction technique involves learning a vocabulary of words and transforming raw text into a vector of word counts where each feature represents a token (usually a word or phrase) and the values represent how often that token appears in the text. This multi-hot-encoding of text is often referred to as a bag-of-words model and has many advantages including ease of implementation and yielding interpretable features (figure 2.12). We will be comparing this classic NLP technique to its more modern deep-learning-based feature learning model cousins in our NLP case study.

FEB

We want to perform feature extraction when:

We can make certain assumptions about our data and rely on fast mathematical transformations to discover new features (we will dive into these assumptions in future case studies)
We are working with unstructured data such as text, images, and videos
Like in feature selection, we are dealing with too many features to be useful, feature extraction can help us reduce our overall dimensionality

任務五: 特徵學習(Feature Learning)

FEB

Feature learning–sometimes referred to as representation learning–is similar to feature extraction in that we are attempting to automatically generate a set of features from raw, unstructured data such as text, images, and videos. Feature learning is different however in that it is performed by applying a non-parametric (making no assumption about the shape of the original underlying data) deep learning model with the intention of automatically discovering a latent representation of the original data. Feature learning is an advanced type of feature engineering and we will see examples of this in the NLP and the Image case studies (figure 2.13).

FEB

Feature learning is often considered the alternative to manual feature engineering as it promises to discover features for us instead of us having to do so. Of course, there are downsides to this approach.

We need to set up a preliminary learning task to learn our representations which could require lots more data The representation that is automatically learned may not be as good as the human-driven features Features that are learned are often uninterpretable as they are created by the machine with no regard for interpretability Overall, we want to perform feature learning when:

We cannot make certain assumptions about our data like in feature extraction and we are working with unstructured data such as text, images, and videos Also, like in feature selection, feature learning can help us reduce our overall dimensionality and also expand our dimensionality if necessary

四種特徵工程度量

度量一: 機器學習度量 (Machine Learning Metrics)

FEB

Comparison to baseline machine learning metrics is likely the most straightforward method; this entails looking at model performance before and after applying feature engineering methods to the data. The steps are:

Get a baseline performance of the machine learning model we are planning to use before applying any feature engineering
Perform feature engineering on the data
Get a new performance metric from the machine learning model and compare it to the metrics obtained in the first step. If the ROI surpasses some threshold defined by the data scientist. Note that ROI here should take into account both the delta in model performance and ease of feature engineering. For example whether or not paying a 3rd party data platform to augment our data for a gain of $0.5 %$ accuracy on our validation set is worth it is entirely up to the model stakeholders

FEB

METRICS GALORE: Supervised metrics such as precision and recall (which we will cover in the first case study) are just a few of many metrics that we can use to gauge how well the model is doing. We can also rely on unsupervised metrics like the Davies-Bouldin Index for clustering but this will not be the focus of any of our case studies in this book.

度量二: 可解釋性 (Interpretability)

FEB

Data scientists and other model stakeholders should care deeply about pipeline interpretability as it can impact both business and engineering decisions. Interpretability can be defined as how well we can ask our model “why” it made the decision it did and tie that decision back to individual features or groups of features that were most relevant in making the model’s decision.

FEB

Imagine we are a data scientist building an ML model that predicts a user’s probability of being a spamming bot. We can build a model using features like the speed of clicks, etc. When our model is in production, we run the risk of seeing some false positives and kicking people off of our site when our model thinks they were a bot. In order to be transparent with them, we would want our model to have a level of interpretability so that we could diagnose which features the model thinks are the most important in making this prediction and redesigning the model if necessary. The choice of feature engineering procedure can greatly increase or severely hinder our ability to explain how or why the model is performing the way it is. Feature improvement, construction, and selection will often help us gain insight into model performance while feature learning and feature extraction techniques will often lessen the transparency of the machine learning pipeline.

度量三: 公平與偏誤 (Fairness and Bias)

FEB

Models must be evaluated against fairness criteria to make sure that they are not generating predictions based on biases inherent in the data. This is especially true in domains of high impact to individuals such as financial loangranting systems, recognition algorithms, fraud detection, and academic performance prediction. In the same 2020 data science survey, over half of respondents said they had or are planning to implement a solution to make models more explainable (interpretable) while only $38 %$ of respondents said the same about fairness and bias mitigation. AI and machine learning models are prone to exploiting biases found in data and scaling them up to a degree that can become harmful to those the data is biased against. Proper feature engineering can expose certain biases and help reduce them at model training time.

度量四: 機器學習複雜度與速度 (ML complexity and speed)

FEB

Often an afterthought, machine learning pipeline complexity, size, and speed can sometimes make or break a deployment. As we mentioned before, sometimes data scientists will turn to large learning algorithms like neural networks or ensemble models in lieu of proper feature engineering in the hopes that the model will “figure it out” for itself. These models have the downside of being large in memory and being slow to train and sometimes slow to predict. Most data scientists have at least one story about how after weeks of data wrangling, model training, and intense evaluation it was revealed that the model wasn’t able to generate predictions fast enough or was taking up too much memory to be considered “production-ready”. Techniques such as dimension reduction (a school of feature engineering under feature extraction and feature learning) can play a big part here. By reducing the size of the data, we can expect a reduction in the size of our models and an improvement in model speed.

後記

2022.02.11. 紫蕊於西拉法葉, 印第安納, 美國.

Version	Date	Summary
0.1	2022-02-11	總結前三個任務! 之後在補充資訊

MUR044 特徵工程的五種類型與四種衡量 (Ver 0.1)

目錄

特徵工程的五種類型與四種衡量

前言

五種特徵工程任務