[๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹] Session 12. ์ˆœ์ฐจ ํŠน์„ฑ ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํŠน์„ฑ ์ค‘์š”๋„ ์‚ฌ์šฉ

2020. 2. 16. 00:56ยท๐Ÿฌ ML & Data/๐ŸŽซ ๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹
728x90

์ด๋ฒˆ ์„ธ์…˜์—์„œ๋Š” ์ˆœ์ฐจ ํŠน์„ฑ ์„ ํƒ์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์—์„œ ํŠน์„ฑ ์ค‘์š”๋„๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ฐ€๋ณผ๊นŒ์š”?

1. ์ˆœ์ฐจ ํŠน์„ฑ ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜

 ๋ชจ๋ธ ๋ณต์žก๋„๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์„ Session 11์—์„œ ์†Œ๊ฐœํ–ˆ์—ˆ๋Š”๋ฐ์š”, ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์€ ํŠน์„ฑ ์„ ํƒ์„ ํ†ตํ•œ ์ฐจ์› ์ถ•์†Œ(dimensionality reduction)๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทœ์ œ๊ฐ€ ์—†๋Š” ๋ชจ๋ธ์—์„œ ์œ ์šฉํ•˜์ฃ . ์ฐจ์› ์ถ•์†Œ์—๋Š” ์ฃผ์š” ์นดํ…Œ๊ณ ๋ฆฌ์ธ ํŠน์„ฑ ์„ ํƒ(feature selection)๊ณผ ํŠน์„ฑ ์ถ”์ถœ(feature extraction)์ด ์žˆ์Šต๋‹ˆ๋‹ค. 

 ํŠน์„ฑ ์„ ํƒ์€ ํŠน์„ฑ ์ค‘์—์„œ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด๊ณ , ์ถ”์ถœ์€ ํŠน์„ฑ์—์„œ ์–ป์€ ์ •๋ณด๋“ค๋กœ ์ƒˆ ํŠน์„ฑ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํŠน์„ฑ ์„ ํƒ์— ์žˆ์–ด์„œ ์ค‘์š”ํ•œ ๊ฒƒ์€ ๋ฌธ์ œ์— ๊ฐ€์žฅ ๊ด€๋ จ์ด ๋†’์€ ํŠน์„ฑ ๋ถ€๋ถ„์ง‘ํ•ฉ์„ ์ž๋™์„ ํƒํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” ํŠน์„ฑ ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ ์ˆœ์ฐจ ํŠน์„ฑ ์„ ํƒ(sequential feature selection) ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ˆœ์ฐจ ํŠน์„ฑ ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํƒ์ƒ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜(greedy search algorithm)์œผ๋กœ d ์ฐจ์›์ด์—ˆ๋˜ ํŠน์„ฑ๊ณต๊ฐ„์„ d๋ณด๋‹ค ์ž‘์€ k ์ฐจ์›์œผ๋กœ ์ถ•์†Œ์‹œํ‚ต๋‹ˆ๋‹ค. 

 ์ˆœ์ฐจ ํŠน์„ฑ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ์ „ํ†ต์ ์ธ ๊ฒƒ์€ ์ˆœ์ฐจ ํ›„์ง„ ์„ ํƒ(sequential backward selection, SBS)์ž…๋‹ˆ๋‹ค. SBS๋Š” ์ดˆ๊ธฐ ํŠน์„ฑ์˜ ๋ถ€๋ถ„๊ณต๊ฐ„์œผ๋กœ ์ฐจ์›์„ ์ถ•์†Œ์‹œํ‚ต๋‹ˆ๋‹ค. 

 SBS ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ƒˆ ํŠน์„ฑ์˜ ๋ถ€๋ถ„๊ณต๊ฐ„์ด ๋ชฉํ‘œํ•œ ํŠน์„ฑ ๊ฐœ์ˆ˜๊ฐ€ ๋  ๋•Œ๊นŒ์ง€ ์ „์ฒด ํŠน์„ฑ์—์„œ ์ˆœ์ฐจ์ ์œผ๋กœ ํŠน์„ฑ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ํŠน์„ฑ์˜ ์ œ๊ฑฐ ๊ธฐ์ค€์„ ์œ„ํ•ด ์ตœ์†Œํ™”ํ•  ๊ธฐ์ค€ ํ•จ์ˆ˜๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์ค€ํ•จ์ˆ˜์—์„œ ๊ณ„์‚ฐํ•œ ๊ฐ’์€ ์ œ๊ฑฐ ์ „ํ›„์˜ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ์ฐจ์ด์ž…๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๊ธฐ์ค€ ๊ฐ’์ด ํฐ ํŠน์„ฑ์„ ์ œ๊ฑฐํ•˜๊ฒŒ ๋˜๊ฒ ์ฃ . ๊ฐ„๋‹จํžˆ ๋„ค ๋‹จ๊ณ„๋กœ ์ •๋ฆฌํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  1. ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ k=d (d๋Š” ์ „์ฒด ํŠน์„ฑ๊ณต๊ฐ„์˜ ์ฐจ์›)๋กœ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
  2. ์กฐ๊ฑด x = argmax J(Xk - x)๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ํŠน์„ฑ x'๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
  3. ํŠน์„ฑ ์ง‘ํ•ฉ์—์„œ ํŠน์„ฑ x'๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. 
  4. k๊ฐ€ ๋ชฉํ‘œํ•œ ๊ฐœ์ˆ˜๊ฐ€ ๋˜๋ฉด ์ข…๋ฃŒํ•˜๊ฑฐ๋‚˜ 2๋กœ ๋Œ์•„๊ฐ‘๋‹ˆ๋‹ค.

 SBS ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์‚ฌ์ดํ‚ท๋Ÿฐ์— ๊ตฌํ˜„๋˜์–ด์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํŒŒ์ด์ฌ์œผ๋กœ ์ง์ ‘ ๊ตฌํ˜„ํ•œ ์ฝ”๋“œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

from sklearn.base import clone
from itertools import combinations
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


class SBS():
    def __init__(self, estimator, k_features, scoring=accuracy_score,
                 test_size=0.25, random_state=1):
        self.scoring = scoring
        self.estimator = clone(estimator)
        self.k_features = k_features
        self.test_size = test_size
        self.random_state = random_state

    def fit(self, X, y):
        
        X_train, X_test, y_train, y_test = \
            train_test_split(X, y, test_size=self.test_size,
                             random_state=self.random_state)

        dim = X_train.shape[1]
        self.indices_ = tuple(range(dim))
        self.subsets_ = [self.indices_]
        score = self._calc_score(X_train, y_train, 
                                 X_test, y_test, self.indices_)
        self.scores_ = [score]

        while dim > self.k_features:
            scores = []
            subsets = []

            for p in combinations(self.indices_, r=dim - 1):
                score = self._calc_score(X_train, y_train, 
                                         X_test, y_test, p)
                scores.append(score)
                subsets.append(p)

            best = np.argmax(scores)
            self.indices_ = subsets[best]
            self.subsets_.append(self.indices_)
            dim -= 1	
            
        	self.scores_.append(scores[best])
        self.k_score_ = self.scores_[-1]

        return self

    def transform(self, X):
        return X[:, self.indices_]

    def _calc_score(self, X_train, y_train, X_test, y_test, indices):
        self.estimator.fit(X_train[:, indices], y_train)
        y_pred = self.estimator.predict(X_test[:, indices])
        score = self.scoring(y_test, y_pred)
        return score

 

 ์—ฌ๊ธฐ์„œ ๋ชฉํ‘œํ•œ ํŠน์„ฑ ๊ฐœ์ˆ˜ k๋Š” k_feature ๋งค๊ฐœ๋ณ€์ˆ˜์ž…๋‹ˆ๋‹ค. accuracy_score ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ณ , fit ๋ฉ”์„œ๋“œ์˜ ๋ฐ˜๋ณต๋ฌธ ์•ˆ์—์„œ itertools.combination ํ•จ์ˆ˜์— ์˜ํ•ด ์ƒ์„ฑ๋œ ํŠน์„ฑ ์กฐํ•ฉ์„ ํ‰๊ฐ€ํ•˜๊ณ  ์ค„์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  X_test ์— ๊ธฐ์ดˆํ•œ ์กฐํ•ฉ์˜ ์ •ํ™•๋„ ์ ์ˆ˜๋ฅผ self.scores_๋ฆฌ์ŠคํŠธ์— ๋ชจ์๋‹ˆ๋‹ค. ์ด ์ ์ˆ˜๋กœ ๋‚˜์ค‘์— ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ตœ์ข…์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ ํŠน์„ฑ์˜ ์—ด ์ธ๋ฑ์Šค๋Š” self.indices_์— ํ• ๋‹น๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ transform์—์„œ ์„ ํƒ๋œ ํŠน์„ฑ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ƒˆ๋กœ์šด ๋ฐฐ์—ด์„ ๋ฐ˜ํ™˜ํ•  ๋•Œ ์“ฐ์ž…๋‹ˆ๋‹ค.

 ์ด์ œ ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ KNN ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ†ตํ•ด์„œ ํ™•์ธํ•ด๋ณผ๊นŒ์š”?

import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)

# ํŠน์„ฑ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
sbs = SBS(knn, k_features=1)
sbs.fit(X_train_std, y_train)

# ํŠน์„ฑ ์กฐํ•ฉ์˜ ์„ฑ๋Šฅ ๊ทธ๋ž˜ํ”„๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
k_feat = [len(k) for k in sbs.subsets_]

plt.plot(k_feat, sbs.scores_, marker='o')
plt.ylim([0.7, 1.02])
plt.ylabel('Accuracy')
plt.xlabel('Number of features')
plt.grid()
plt.tight_layout()
plt.show()

 

 fit ์•ˆ์—์„œ SBS๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์„ ํ›ˆ๋ จ๊ณผ ํ…Œ์ŠคํŠธ๋กœ ๋‚˜๋ˆ„๊ธฐ๋Š” ํ•˜์ง€๋งŒ ์—ฌ์ „ํžˆ ์ด ์ฝ”๋“œ์—์„œ๋Š” X_train ๋ฐ์ดํ„ฐ๋งŒ ์ฃผ์ž…ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ SBS์˜ fit ๋ฉ”์„œ๋“œ๊ฐ€ ๋‚˜๋ˆ„๋Š” ๋ฐ์ดํ„ฐ์…‹ ์ค‘ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ๊ฒ€์ฆ์„ธํŠธ(validation set)์ด๋ผ๊ณ  ๋ถ€๋ฅด๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ์—๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ๋ฆฌ ๋ถ„๋ฆฌํ•ด๋†“์•„์•ผํ•ฉ๋‹ˆ๋‹ค.

 SBS๋กœ ๊ฐ ๋‹จ๊ณ„์—์„œ ๊ฐ€์žฅ ์ข‹์€ ํŠน์„ฑ์กฐํ•ฉ์˜ ์ ์ˆ˜๋ฅผ ๋ชจ์•„๋†“์•˜์œผ๋ฏ€๋กœ ์ด ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ๊ฒ€์ฆ ์„ธํŠธ๋กœ ๊ณ„์‚ฐํ•œ KNN ๋ถ„๋ฅ˜๊ธฐ์˜ ์ •ํ™•๋„๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 

 ์œ„ ๊ทธ๋ž˜ํ”„์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ์ด KNN ๋ถ„๋ฅ˜๊ธฐ์˜ ์ •ํ™•๋„๋Š” ํŠน์„ฑ ๊ฐœ์ˆ˜๊ฐ€ ์ค„์—ˆ์„ ๋•Œ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ด์ œ ์›๋ž˜ ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ์˜ KNN ๋ถ„๋ฅ˜๊ธฐ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. 

knn.fit(X_train_std, y_train)
print('ํ›ˆ๋ จ ์ •ํ™•๋„:', knn.score(X_train_std, y_train))
print('ํ…Œ์ŠคํŠธ ์ •ํ™•๋„:', knn.score(X_test_std, y_test)

 ํ›ˆ๋ จ ์„ธํŠธ์—์„œ๋Š” 97% ์ •๋„์˜ ์ •ํ™•๋„๋ฅผ, ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ๋Š” 96% ์ •๋„์˜ ์ •ํ™•๋„๋ฅผ ๋ณด์—ฌ์ฃผ๋„ค์š”. ๊ทธ๋Ÿผ ์„ ํƒ๋œ ์„ธ ๊ฐœ์˜ ํŠน์„ฑ์—์„œ์˜ ์„ฑ๋Šฅ๋„ ํ™•์ธํ•ด๋ณผ๊นŒ์š”?

knn.fit(X_train_std[:, k3], y_train)
print('ํ›ˆ๋ จ ์ •ํ™•๋„:', knn.score(X_train_std[:, k3], y_train))
print('ํ…Œ์ŠคํŠธ ์ •ํ™•๋„:', knn.score(X_test_std[:, k3], y_test))

 ์ „์ฒด ํŠน์„ฑ์˜ 1/4๋„ ์•ˆ๋˜๋Š” ํŠน์„ฑ์„ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ ์ •ํ™•๋„๋Š” ํฌ๊ฒŒ ๋–จ์–ด์กŒ๋‹ค๊ณ  ํ•˜๊ธด ํž˜๋“ค์–ด๋ณด์ž…๋‹ˆ๋‹ค. ์ด ์„ธ ๊ฐœ์˜ ํŠน์„ฑ์˜ ํŒ๋ณ„์ •๋ณด๊ฐ€ ์›๋ž˜ ๋ฐ์ดํ„ฐ์…‹๋ณด๋‹ค ๊ทธ๋ฆฌ ์ž‘์ง€ ์•Š๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค. 

 Wine ๋ฐ์ดํ„ฐ์…‹์€ ์›๋ž˜๋„ ๊ทธ๋ฆฌ ํฌ์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ์…‹์ด๋ผ ๋ฐ์ดํ„ฐ์…‹์„ ํ›ˆ๋ จ๊ณผ ํ…Œ์ŠคํŠธ๋กœ ๋‚˜๋ˆˆ ๊ฒƒ๊ณผ ๋‹ค์‹œ ํ›ˆ๋ จ๊ณผ ๊ฒ€์ฆ์œผ๋กœ ๋‚˜๋ˆˆ ๊ฒƒ์— ์˜ํ–ฅ์„ ๋งŽ์ด ๋ฐ›์Šต๋‹ˆ๋‹ค.

 ์—ฌ๊ธฐ์„œ ์•Œ ์ˆ˜ ์žˆ๋Š” ์ ์€ ํŠน์„ฑ ๊ฐœ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์ด KNN ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋†’์ด์ง„ ์•Š์ง€๋งŒ ํ…Œ์ดํ„ฐ ํฌ๊ธฐ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค๋Š” ์ ์ด๊ณ , ๊ทธ ๋•Œ๋ฌธ์— ๋” ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

 

2. ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์˜ ํŠน์„ฑ ์ค‘์š”๋„ ์‚ฌ์šฉ

 ์ด์ „ ์„ธ์…˜์—์„œ ์•™์ƒ๋ธ”์„ ์†Œ๊ฐœํ•  ๋•Œ ์ž ๊น ๋“ฑ์žฅํ–ˆ๋˜ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋ฅผ ๊ธฐ์–ตํ•˜์‹œ๋‚˜์š”? ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ฒฐ์ • ํŠธ๋ฆฌ์—์„œ ๊ณ„์‚ฐํ•œ ํ‰๊ท  ๋ถˆ์†๋„๋ฅผ ๊ฐ์†Œ์‹œํ‚ด์œผ๋กœ์จ ํŠน์„ฑ์˜ ์ค‘์š”๋„๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ RandomForestClassifier ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ณ  feature_importances_์†์„ฑ์—์„œ ํŠน์„ฑ ์ค‘์š”๋„ ๊ฐ’์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from sklearn.ensemble import RandomForestClassifier

feat_labels = df_wine.columns[1:]

forest = RandomForestClassifier(n_estimators=500,
                                random_state=1)

forest.fit(X_train, y_train)
importances = forest.feature_importances_

indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feat_labels[indices[f]], 
                            importances[indices[f]]))

plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]), 
        importances[indices],
        align='center')

plt.xticks(range(X_train.shape[1]), 
           feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
plt.show()

 ์œ„ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ๊ฐ ํŠน์„ฑ์˜ ์ƒ๋Œ€์  ์ค‘์š”๋„์— ๋”ฐ๋ฅธ ์ˆœ์œ„๋ฅผ ํ‘œ๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด ์ค‘์š”๋„๋Š” ํ•ฉ์ด 1์ด ๋˜๋„๋ก ์ •๊ทœํ™” ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค. 500๊ฐœ ๊ฒฐ์ • ํŠธ๋ฆฌ์—์„œ ๊ฐ€์žฅ ํŒ๋ณ„๋ ฅ์ด ์ข‹์€ ํŠน์„ฑ์€ proline๋ถ€ํ„ฐ alcohol๊นŒ์ง€์ž…๋‹ˆ๋‹ค. ์ด ๊ทธ๋ž˜ํ”„์—์„œ ์ƒ์œ„ ํŠน์„ฑ ์ค‘ ๋‘ ๊ฐœ๋Š” ์œ„์—์„œ ๊ตฌํ˜„ํ•œ SBS ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ์„ ํƒํ•œ 3๊ฐœ์˜ ํŠน์„ฑ์— ๋“ค์–ด์žˆ์Šต๋‹ˆ๋‹ค. 

 ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์—์„œ ๋‘ ๊ฐœ ์ด์ƒ์˜ ํŠน์„ฑ์ด ์„œ๋กœ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๊นŠ๋‹ค๋ฉด, ํ•˜๋‚˜๋Š” ์•„์ฃผ ์ž˜ ์žก์•„๋‚ด์ง€๋งŒ ๋‹ค๋ฅธ ์ •๋ณด๋Š” ์ž˜ ์ฐพ์•„๋‚ด์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งŒ์•ฝ ํŠน์„ฑ ์ค‘์š”๋„ ๊ฐ’๋ณด๋‹ค ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์—๋งŒ ๊ด€์‹ฌ์ด ์žˆ๋‹ค๋ฉด ๋ฌด์‹œํ•˜์…”๋„ ์ข‹์Šต๋‹ˆ๋‹ค๋งŒ, ์•Œ์•„๋‘์…”๋„ ์ข‹์Šต๋‹ˆ๋‹ค.

 ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ SelectFromModel์€ ๋ชจ๋ธ ํ›ˆ๋ จ์ด ๋๋‚œ ๋‹ค์Œ์— ์‚ฌ์šฉ์ž๊ฐ€ ์ •ํ•œ ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํŠน์„ฑ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ๋‚˜์ค‘์— ๋“ฑ์žฅํ•  Pipeline์˜ ๋‹จ๊ณ„์—์„œ RondomForestClassifier๋ฅผ ํŠน์„ฑ ์„ ํƒ๊ธฐ๋กœ ์‚ฌ์šฉํ•  ๋•Œ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜ ์ฝ”๋“œ๋Š” ์‚ฌ์šฉ์ž ์ง€์ • ๊ฐ’, ์ฆ‰ ์ž„๊ณ„๊ฐ’์„ 0.1๋กœ ํ•ด ํŠน์„ฑ์„ ์ค‘์š”ํ•œ 5๊ฐœ๋กœ ์ค„์—ฌ์ค๋‹ˆ๋‹ค.

from sklearn.feature_selection import SelectFromModel

sfm = SelectFromModel(forest, threshold=0.1, prefit=True)
X_selected = sfm.transform(X_train)
print('์ด ์ž„๊ณ„ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ์ƒ˜ํ”Œ์˜ ์ˆ˜:', X_selected.shape[1])
# ์ด ์ž„๊ณ„ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ์ƒ˜ํ”Œ์˜ ์ˆ˜: 5

for f in range(X_selected.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feat_labels[indices[f]], 
                            importances[indices[f]]))


์—ฌ๊ธฐ๊นŒ์ง€ ์ˆœ์ฐจ ํŠน์„ฑ ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ฃผ๋กœ ์ฐจ์› ์ถ•์†Œ๋‚˜ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๋Š” ๊ธฐ๋ฒ•๋“ค์ด์—ˆ๋Š”๋ฐ์š”, ๋‹ค์Œ์œผ๋กœ ๊ธฐ๋‹ค๋ฆฌ๊ณ  ์žˆ๋Š” ์„ธ ๊ฐœ์˜ ์„ธ์…˜๋“ค์—์„œ๋„ ์ฐจ์› ์ถ•์†Œ๋ฅผ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ ์••์ถ•์„ ๋‹ค๋ฃฐ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ ์ €๋Š” ๋‹ค์Œ ์„ธ์…˜์—์„œ PCA๋ฅผ ๋“ค๊ณ  ๋Œ์•„์˜ค๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ ์„ธ์…˜์—์„œ ๋ดฌ์š”!

728x90
์ €์ž‘์žํ‘œ์‹œ (์ƒˆ์ฐฝ์—ด๋ฆผ)

'๐Ÿฌ ML & Data > ๐ŸŽซ ๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹] Session 14. LDA๋ฅผ ํ†ตํ•œ ์ง€๋„ํ•™์Šต๋ฐฉ์‹ ๋ฐ์ดํ„ฐ ์••์ถ•  (0) 2020.02.21
[๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹] Session 13. ๋น„์ง€๋„ ์ฐจ์›์ถ•์†Œ! PCA!  (0) 2020.02.18
[๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹] Session 11. ๋ฐ์ดํ„ฐ์…‹ ๋‚˜๋ˆ„๊ธฐ์™€ ํŠน์„ฑ ์Šค์ผ€์ผ๊ณผ ์„ ํƒ  (0) 2020.02.13
[๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹] Session 10. ๋ˆ„๋ฝ ๋ฐ์ดํ„ฐ์™€ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ ๋‹ค๋ฃจ๊ธฐ  (0) 2020.02.11
[๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹] Session 9. ๊ฒฐ์ • ํŠธ๋ฆฌ ํ•™์Šต  (0) 2020.02.11
'๐Ÿฌ ML & Data/๐ŸŽซ ๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
  • [๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹] Session 14. LDA๋ฅผ ํ†ตํ•œ ์ง€๋„ํ•™์Šต๋ฐฉ์‹ ๋ฐ์ดํ„ฐ ์••์ถ•
  • [๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹] Session 13. ๋น„์ง€๋„ ์ฐจ์›์ถ•์†Œ! PCA!
  • [๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹] Session 11. ๋ฐ์ดํ„ฐ์…‹ ๋‚˜๋ˆ„๊ธฐ์™€ ํŠน์„ฑ ์Šค์ผ€์ผ๊ณผ ์„ ํƒ
  • [๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹] Session 10. ๋ˆ„๋ฝ ๋ฐ์ดํ„ฐ์™€ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ ๋‹ค๋ฃจ๊ธฐ
darly213
darly213
ํ˜ธ๋ฝํ˜ธ๋ฝํ•˜์ง€ ์•Š์€ ๊ฐœ๋ฐœ์ž๊ฐ€ ๋˜์–ด๋ณด์ž
  • darly213
    ERROR DENY
    darly213
  • ์ „์ฒด
    ์˜ค๋Š˜
    ์–ด์ œ
    • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (97)
      • ๐Ÿฌ ML & Data (50)
        • ๐ŸŒŠ Computer Vision (2)
        • ๐Ÿ“ฎ Reinforcement Learning (12)
        • ๐Ÿ“˜ ๋…ผ๋ฌธ & ๋ชจ๋ธ ๋ฆฌ๋ทฐ (8)
        • ๐Ÿฆ„ ๋ผ์ดํŠธ ๋”ฅ๋Ÿฌ๋‹ (3)
        • โ” Q & etc. (5)
        • ๐ŸŽซ ๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹ (20)
      • ๐Ÿฅ Web (21)
        • โšก Back-end | FastAPI (2)
        • โ›… Back-end | Spring (5)
        • โ” Back-end | etc. (9)
        • ๐ŸŽจ Front-end (4)
      • ๐ŸŽผ Project (8)
        • ๐ŸงŠ Monitoring System (8)
      • ๐Ÿˆ Algorithm (0)
      • ๐Ÿ”ฎ CS (2)
      • ๐Ÿณ Docker & Kubernetes (3)
      • ๐ŸŒˆ DEEEEEBUG (2)
      • ๐ŸŒ  etc. (8)
      • ๐Ÿ˜ผ ์‚ฌ๋‹ด (1)
  • ๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

    • ํ™ˆ
    • ๋ฐฉ๋ช…๋ก
    • GitHub
    • Notion
    • LinkedIn
  • ๋งํฌ

    • Github
    • Notion
  • ๊ณต์ง€์‚ฌํ•ญ

    • Contact ME!
  • 250x250
  • hELLOยท Designed By์ •์ƒ์šฐ.v4.10.3
darly213
[๋ผ์ดํŠธ ๋จธ์‹ ๋Ÿฌ๋‹] Session 12. ์ˆœ์ฐจ ํŠน์„ฑ ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํŠน์„ฑ ์ค‘์š”๋„ ์‚ฌ์šฉ
์ƒ๋‹จ์œผ๋กœ

ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”