聚类的应用#

Using Clustering for Preprocess#

"""
simple MNIST-like datasets containing 1797 grayscale 8*8 images
"""
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

X_digits, y_digits = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits)
from sklearn.linear_model import LogisticRegression

# vanilla model
log_reg = LogisticRegression()
"""
Instead of vanilla model
We can create a pipeline that will first cluster the training set into 50 clusters, 
then replace the images with their distances to these 50 clusters.
"""
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans

pipeline = Pipeline([
    ("kmeans", KMeans(n_clusters=50)),
    ("log_reg", LogisticRegression()),
])
from sklearn.model_selection import GridSearchCV

# find the optimal n_clusters
param_grid = dict(kmeans__n_clusters=range(2, 100, 2))
# verbose=2: print each model's result
grid_clf = GridSearchCV(pipeline, param_grid, cv=3, verbose=2)

Using Clustering for Semi-Supervised Learning#

When we have plenty of unlabeled instances and very few labeled instances.

For example we can only label 50 instances, instead of choose instances at random, we can:

1.cluster the training set into 50 clusters

2.for each cluster, find the image closest to the centroid, just label these representatives.

Propagate the labels to the 20% of the instances that are closest to the centroids is even better!