在小数据和传统的机器学习中,scikit-learn是一个优秀的框架,最优秀的地方是它方便的和统一化的API,可以简单调用,进行属于的预处理。
这里总结一下该框架中常用的数据处理的代码和说明,并进行一个自己的复习。
将数据缩放为均值为0,标准差为1的尺度。
from sklearn.preprocessing import scale
# Standardizing each column of data, the data is numpy array
col_standardized = scale(data)
create a function to do the preprocessing
def standardize_data(data):
scaled_data = scale(data)
return scaled_data
将数据缩放到一个data-range内。
from sklearn.preprocessing import MinMaxScaler
default_scaler = MinMaxScaler() # the default range is [0,1]
transformed = default_scaler.fit_transform(data)
custom_scaler = MinMaxScaler(feature_range=(-2, 3))
transformed = custom_scaler.fit_transform(data)
fit和transform也可以分开进行处理。
from sklearn.preprocessing import MinMaxScaler
default_scaler = MinMaxScaler() # the default range is [0,1]
transformed = default_scaler.fit_transform(new_data)
default_scaler = MinMaxScaler() # new instance
default_scaler.fit(data) # different data value fit
transformed = default_scaler.transform(new_data)
create a function to do preprocessing
def ranged_data(data, value_range):
min_max_scaler = MinMaxScaler(feature_range=value_range)
scaled_data = min_max_scaler.fit_transform(data)
return scaled_data
“稳健型缩放” 是指在数据处理或统计分析中采用的一种技术,用于调整数据的尺度,以减少异常值的影响,并使数据更适合于建模或分析。
通常情况下,数据集中可能存在一些异常值或极端值,这些值可能会对统计分析或建模产生不良影响。稳健型缩放技术旨在减少这种影响,使得模型更加稳健,即对异常值更具鲁棒性。
from sklearn.preprocessing import RobustScaler
robust_scaler = RobustScaler()
transformed = robust_scaler.fit_transform(data)
create a function to do preprocessing
def robust_scaling(data):
robust_scaler = RobustScaler()
scaled_data = robust_scaler.fit_transform(data)
return scaled_data
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
transformed = normalizer.fit_transform(data)
create a function to do the preprocessing
def normalize_data(data):
normalizer = Normalizer()
norm_data = normalizer.fit_transform(data)
return norm_data
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer()
transformed = imp_mean.fit_transform(data)
默认使用mean值进行插值,除此之外还有中值median,最频繁值,以及常数。可以通过策略参数进行定义。
from sklearn.impute import SimpleImputer
imp_median = SimpleImputer(strategy='median')
transformed = imp_median.fit_transform(data)
imp_frequent = SimpleImputer(strategy='most_frequent')
transformed = imp_frequent.fit_transform(data)
from sklearn.impute import SimpleImputer
imp_constant = SimpleImputer(strategy='constant',
fill_value=-1)
transformed = imp_constant.fit_transform(data)
除此之外,数据填充不仅限于这四种方法。
还有更高级的填充方法,如k-Nearest Neighbors(根据kNN算法的相似度分数填充缺失值)和MICE(应用多个链式填充,假设缺失值在观测值之间是随机分布的)。
在大多数行业案例中,这些高级方法并不是必需的,因为数据要么已经完全清洗过,要么缺失值很少。然而,在处理开源数据集时,这些高级方法可能会有用,因为这些数据集往往更不完整。
from sklearn.decomposition import PCA
pca_obj = PCA() # The value of n_component will be 4. As m is 5 and default is always m-1
pc = pca_obj.fit_transform(data).round(3)
pca_obj = PCA(n_components=3)
pc = pca_obj.fit_transform(data).round(3)
pca_obj = PCA(n_components=2)
pc = pca_obj.fit_transform(data).round(3)
create a function to do preprocessing
def pca_data(data, n_components):
pca_obj = PCA(n_components = n_components)
component_data = pca_obj.fit_transform(data)
return component_data
针对降维后的数据进行数据分割:
def get_label_info(component_data, labels,
label, label_names):
label_name = label_names[label]
label_data = component_data[labels == label]
return (label_name, label_data)
def separate_data(component_data, labels,
label_names):
separated_data = []
for label in range(len(label_names)):
separated_data.append(get_label_info(component_data, labels, label, label_names))
return separated_data
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
bc = load_breast_cancer()
pca_obj = PCA(n_components=2)
component_data = pca_obj.fit_transform(bc.data)
labels = bc.target
label_names = bc.target_names
# Using the completed separate_data function
separated_data = separate_data(component_data,
labels, label_names)
# Plotting the data
import matplotlib.pyplot as plt
for label_name, label_data in separated_data:
col1 = label_data[:, 0] # 1st column (1st pr. comp.)
col2 = label_data[:, 1] # 2nd column (2nd pr. comp.)
plt.scatter(col1, col2, label=label_name) # scatterplot
plt.legend() # adds legend to plot
plt.title('Breast Cancer Dataset PCA Plot')
plt.show()