첫 번째 머신러닝 만들어 보기 - 붓꽃 품종 예측하기¶

In [1]:

import warnings
warnings.filterwarnings('ignore')

import pandas as pd

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [2]:

# 데이터 세트 로딩
iris = load_iris()
iris_data = iris.data
iris_label = iris.target

print('iris target값 : ', iris_label)
print('iris target명 : ', iris.target_names)

iris target값 :  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
iris target명 :  ['setosa' 'versicolor' 'virginica']

In [3]:

iris_df = pd.DataFrame(data = iris_data, columns = iris.feature_names)
iris_df['label'] = iris.target
iris_df.head()

Out[3]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

In [4]:

# 학습용 데이터와 테스트용 데이터로 분리

X_train, X_test, y_train, y_test = train_test_split(iris_data, iris_label,
                                                    test_size = 0.2, random_state = 11)

In [5]:

# 머신러닝 분류 알고리즘 중 의사결정 트리를 사용해 학습과 예측 진행

# 모델 생성
dt_clf = DecisionTreeClassifier(random_state = 11)

# 학습 수행
dt_clf.fit(X_train, y_train)

Out[5]:

DecisionTreeClassifier(random_state=11)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [6]:

# 학습된 모델 기반에서 테스트 데이터 세트에 대한 예측값 반환

pred = dt_clf.predict(X_test)

y_test[:3] # 데스트 데이터의 label
pred[:3] # 예측 label

Out[6]:

array([2, 2, 2])

Out[6]:

array([2, 2, 1])

In [7]:

# 정확도 측정

from sklearn.metrics import accuracy_score

print('예측 정확도 : {0:.4f}'.format(accuracy_score(y_test, pred)))

예측 정확도 : 0.9333

In [8]:

# 새로운 데이터 넣어보기

dt_clf.predict([[10.0, 10.0, 10.0, 10.0]])

Out[8]:

array([2])

사이킷런의 기반 프레임워크 익히기¶

In [9]:

iris_data = load_iris()

type(iris_data)

Out[9]:

sklearn.utils._bunch.Bunch

In [10]:

keys = iris_data.keys()

print('붓꽃 데이터 세트의 키들', keys)

붓꽃 데이터 세트의 키들 dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [11]:

print('feature_names의 type : ', type(iris_data.feature_names))
print('feature_names의 shape : ' , len(iris_data.feature_names))
print(iris_data.feature_names)

print(' ')

print('target_names의 type : ', type(iris_data.target_names))
print('target_names의 shape : ' , len(iris_data.target_names))
print(iris_data.target_names)

print(' ')

print('data의 type : ', type(iris_data.data))
print('data의 shape : ' , len(iris_data.data))
print(iris_data.data[:3])

print(' ')

print('target의 type : ', type(iris_data.target))
print('target의 shape : ' , len(iris_data.target))
print(iris_data.target[:3])

feature_names의 type :  <class 'list'>
feature_names의 shape :  4
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
 
target_names의 type :  <class 'numpy.ndarray'>
target_names의 shape :  3
['setosa' 'versicolor' 'virginica']
 
data의 type :  <class 'numpy.ndarray'>
data의 shape :  150
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]]
 
target의 type :  <class 'numpy.ndarray'>
target의 shape :  150
[0 0 0]

Model Selection 모듈 소개¶

학습/테스트 데이터 세트 분리 - train_test_split()¶

In [12]:

iris = load_iris() # 데이터
dt_clf = DecisionTreeClassifier() # 모델 정의

train_data = iris.data # 훈련을 위한 feature 데이터
train_label = iris.target # 훈련을 위한 target 데이터
dt_clf.fit(train_data, train_label) # 모델 학습

# 학습 데이터 세트로 예측
pred = dt_clf.predict(train_data) # 훈령을 위한 feature와 같은 데이터로 예측
pred_sc = accuracy_score(train_label, pred) # 정확도 계산(검증을 위한 target 데이터)

print('예측 정확도 : ', pred_sc)

Out[12]:

DecisionTreeClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

예측 정확도 :  1.0

=> 이미 학습한 학습 데이터 세트를 기반으로 예측했기 때문이다

In [13]:

# 붓꽃 데이터 세트를 train_test_split()을 이용하여 테스트 데이터 세트를 전체의 30%, 학습 데이터 세트를 70%로 분리

iris = load_iris  # 데이터
dt_clf = DecisionTreeClassifier() # 모델 정의

# 학습, 테스트 데이터 세트로 분리
X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target,
                                                    test_size = 0.3, random_state = 121) 

dt_clf.fit(X_train, y_train) # 모델 학습(훈련을 위한 feature, target 데이터)
pred = dt_clf.predict(X_test) # 테스트 데이터 세트의 target으로 예측
pred_sc = accuracy_score(y_test, pred) # 정확도 계산(테스트 데이터 세트의 target으로)

print('예측 정확도 : {0:.4f} '.format(pred_sc))

Out[13]:

DecisionTreeClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

예측 정확도 : 0.9556

교차 검증¶

k폴드 교차 검증

In [33]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

iris = load_iris()
features = iris.data
label = iris.target
dt = DecisionTreeClassifier(random_state = 156)

# 5개의 폴드 세트로 분리하는 KFold객체와 폴드 세트별 정확도를 담을 리스트 객체 생성
kfold = KFold(n_splits = 5)
cv_accuracy = []
print('붓꽃 데이터 세트 크기 : ', features.shape[0])

붓꽃 데이터 세트 크기 :  150

In [38]:

n_iter = 0

for train_index, test_index in kfold.split(features):
    X_train, X_test = features[train_index], features[test_index]
    y_train, y_test = label[train_index], label[test_index]
    
    dt.fit(X_train, y_train)
    pred = dt.predict(X_test)
    n_iter += 1
    
    accuracy = np.round(accuracy_score(y_test, pred), 4)
    train_size = X_train.shape[0]
    test_size = X_test.shape[0]
    print('\n#{0} 교차 검증 정확도 : {1}, 학습 데이터 크기 : {2}, 검증 데이터 크기 : {3}'.format(n_iter, accuracy, train_size, test_size))
    print('\n#{0} 검증 세트 인덱스 : {1}'.format(n_iter, test_index))
    cv_accuracy.append(accuracy)
    
print('\n##평균 검증 정확도 : ', np.mean(cv_accuracy))

Out[38]:

DecisionTreeClassifier(random_state=156)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

#1 교차 검증 정확도 : 1.0, 학습 데이터 크기 : 120, 검증 데이터 크기 : 30

#1 검증 세트 인덱스 : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29]

Out[38]:

DecisionTreeClassifier(random_state=156)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

#2 교차 검증 정확도 : 0.9667, 학습 데이터 크기 : 120, 검증 데이터 크기 : 30

#2 검증 세트 인덱스 : [30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
 54 55 56 57 58 59]

Out[38]:

DecisionTreeClassifier(random_state=156)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

#3 교차 검증 정확도 : 0.8667, 학습 데이터 크기 : 120, 검증 데이터 크기 : 30

#3 검증 세트 인덱스 : [60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
 84 85 86 87 88 89]

Out[38]:

DecisionTreeClassifier(random_state=156)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

#4 교차 검증 정확도 : 0.9333, 학습 데이터 크기 : 120, 검증 데이터 크기 : 30

#4 검증 세트 인덱스 : [ 90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119]

Out[38]:

DecisionTreeClassifier(random_state=156)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

#5 교차 검증 정확도 : 0.7333, 학습 데이터 크기 : 120, 검증 데이터 크기 : 30

#5 검증 세트 인덱스 : [120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
 138 139 140 141 142 143 144 145 146 147 148 149]

##평균 검증 정확도 :  0.9

Stratified K폴드

In [39]:

iris = load_iris()
iris_df = pd.DataFrame(data = iris.data, columns = iris.feature_names)
iris_df['label'] = iris.target
iris_df['label'].value_counts()

Out[39]:

label
0    50
1    50
2    50
Name: count, dtype: int64

In [41]:

kfold = KFold(n_splits = 3)
n_iter = 0

for train_index, test_index in kfold.split(iris_df):
    n_iter += 1
    label_train = iris_df['label'].iloc[train_index]
    label_test = iris_df['label'].iloc[test_index]
    print('\n## 교차검증 : {0}'.format(n_iter))
    print('학습 레이블 데이터 분포 : \n', label_train.value_counts())
    print('검증 레이블 데이터 분포 : \n', label_test.value_counts())    

## 교차검증 : 1
학습 레이블 데이터 분포 : 
 label
1    50
2    50
Name: count, dtype: int64
검증 레이블 데이터 분포 : 
 label
0    50
Name: count, dtype: int64

## 교차검증 : 2
학습 레이블 데이터 분포 : 
 label
0    50
2    50
Name: count, dtype: int64
검증 레이블 데이터 분포 : 
 label
1    50
Name: count, dtype: int64

## 교차검증 : 3
학습 레이블 데이터 분포 : 
 label
0    50
1    50
Name: count, dtype: int64
검증 레이블 데이터 분포 : 
 label
2    50
Name: count, dtype: int64

In [43]:

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits = 3)
n_iter = 0

for train_index, test_index in skf.split(iris_df, iris_df['label']):
    n_iter += 1
    label_train = iris_df['label'].iloc[train_index]
    label_test = iris_df['label'].iloc[test_index]
    print('\n## 교차검증 : {0}'.format(n_iter))
    print('학습 레이블 데이터 분포 : \n', label_train.value_counts())
    print('검증 레이블 데이터 분포 : \n', label_test.value_counts())  

## 교차검증 : 1
학습 레이블 데이터 분포 : 
 label
2    34
0    33
1    33
Name: count, dtype: int64
검증 레이블 데이터 분포 : 
 label
0    17
1    17
2    16
Name: count, dtype: int64

## 교차검증 : 2
학습 레이블 데이터 분포 : 
 label
1    34
0    33
2    33
Name: count, dtype: int64
검증 레이블 데이터 분포 : 
 label
0    17
2    17
1    16
Name: count, dtype: int64

## 교차검증 : 3
학습 레이블 데이터 분포 : 
 label
0    34
1    33
2    33
Name: count, dtype: int64
검증 레이블 데이터 분포 : 
 label
1    17
2    17
0    16
Name: count, dtype: int64

In [44]:

dt = DecisionTreeClassifier(random_state = 156)

skfold = StratifiedKFold(n_splits = 3)
n_iter = 0
cv_accuracy = []

for train_index, test_index in skf.split(features, label):
    X_train, X_test = features[train_index], features[test_index]
    y_train, y_test = label[train_index], label[test_index]
    
    dt.fit(X_train, y_train)
    pred = dt.predict(X_test)
    n_iter += 1
    
    accuracy = np.round(accuracy_score(y_test, pred), 4)
    train_size = X_train.shape[0]
    test_size = X_test.shape[0]
    print('\n#{0} 교차 검증 정확도 : {1}, 학습 데이터 크기 : {2}, 검증 데이터 크기 : {3}'.format(n_iter, accuracy, train_size, test_size))
    print('\n#{0} 검증 세트 인덱스 : {1}'.format(n_iter, test_index))
    cv_accuracy.append(accuracy)

print('\n##교차 검증별 정확도 : ', np.round(cv_accuracy, 4))    
print('\n##평균 검증 정확도 : ', np.mean(cv_accuracy))

Out[44]:

DecisionTreeClassifier(random_state=156)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

#1 교차 검증 정확도 : 0.98, 학습 데이터 크기 : 100, 검증 데이터 크기 : 50

#1 검증 세트 인덱스 : [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  50
  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66 100 101
 102 103 104 105 106 107 108 109 110 111 112 113 114 115]

Out[44]:

DecisionTreeClassifier(random_state=156)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

#2 교차 검증 정확도 : 0.94, 학습 데이터 크기 : 100, 검증 데이터 크기 : 50

#2 검증 세트 인덱스 : [ 17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  67
  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82 116 117 118
 119 120 121 122 123 124 125 126 127 128 129 130 131 132]

Out[44]:

DecisionTreeClassifier(random_state=156)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

#3 교차 검증 정확도 : 0.98, 학습 데이터 크기 : 100, 검증 데이터 크기 : 50

#3 검증 세트 인덱스 : [ 34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  83  84
  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 133 134 135
 136 137 138 139 140 141 142 143 144 145 146 147 148 149]

##교차 검증별 정확도 :  [0.98 0.94 0.98]

##평균 검증 정확도 :  0.9666666666666667

교차 검증을 보다 간편하게 - cross_val_score()

In [48]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, cross_validate

iris_data = load_iris()
dt = DecisionTreeClassifier(random_state = 156)

data = iris_data.data
label = iris_data.target

scores = cross_val_score(dt, data, label, scoring = 'accuracy', cv = 3)
print('교차 검증별 정확도 : ', np.round(scores, 4))
print('평균 검증 정확도 : ', np.round(np.mean(scores), 4))

교차 검증별 정확도 :  [0.98 0.94 0.98]
평균 검증 정확도 :  0.9667

GridSearchCV - 교차 검증과 최적 하이퍼파라미터 튜닝을 한 번에¶

In [52]:

from sklearn.model_selection import GridSearchCV

iris_data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target, test_size = 0.2, random_state = 121)

dt = DecisionTreeClassifier()

parameters = {'max_depth' : [1, 2, 3],
                   'min_samples_split' : [2, 3]}

In [53]:

grid_dtree = GridSearchCV(dt, param_grid = parameters, cv = 3, refit = True)

grid_dtree.fit(X_train, y_train)

scores_df = pd.DataFrame(grid_dtree.cv_results_)
scores_df[['params', 'mean_test_score', 'rank_test_score', 'split0_test_score', 'split1_test_score', 'split2_test_score']]

Out[53]:

GridSearchCV(cv=3, estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [1, 2, 3], 'min_samples_split': [2, 3]})

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GridSearchCV

GridSearchCV(cv=3, estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [1, 2, 3], 'min_samples_split': [2, 3]})

estimator: DecisionTreeClassifier

DecisionTreeClassifier()

DecisionTreeClassifier

DecisionTreeClassifier()

Out[53]:

	params	mean_test_score	rank_test_score	split0_test_score	split1_test_score	split2_test_score
0	{'max_depth': 1, 'min_samples_split': 2}	0.700000	5	0.700	0.7	0.70
1	{'max_depth': 1, 'min_samples_split': 3}	0.700000	5	0.700	0.7	0.70
2	{'max_depth': 2, 'min_samples_split': 2}	0.958333	3	0.925	1.0	0.95
3	{'max_depth': 2, 'min_samples_split': 3}	0.958333	3	0.925	1.0	0.95
4	{'max_depth': 3, 'min_samples_split': 2}	0.975000	1	0.975	1.0	0.95
5	{'max_depth': 3, 'min_samples_split': 3}	0.975000	1	0.975	1.0	0.95

In [55]:

print('GridSearchCV 최적 파라미터 : ', grid_dtree.best_params_)
print('GridSearchCV 최고 정확도 : {0:.4f}'.format(grid_dtree.best_score_))

GridSearchCV 최적 파라미터 :  {'max_depth': 3, 'min_samples_split': 2}
GridSearchCV 최고 정확도 : 0.9750

In [57]:

estimator = grid_dtree.best_estimator_

pred = estimator.predict(X_test)
print('테스트 데이터 세트 정확도 : {0:.4f}'.format(accuracy_score(y_test, pred)))

테스트 데이터 세트 정확도 : 0.9667

데이터 전처리¶

데이터 인코딩¶

레이블 인코딩

카테고리 피쳐를 코드형 숫자 값으로 반환

In [14]:

from sklearn.preprocessing import LabelEncoder

items = ['TV', '냉장고', '전자레인지', '컴퓨터', '선풍기', '믹서', '믹서']

# labelEncoder를 객체로 생성한 후, fit()과 transform()으로 레이블 인코딩 수행
encoder = LabelEncoder()
encoder.fit(items)

labels = encoder.transform(items)
print('인코딩 변환값 : ', labels)
print('인코딩 클래스 : ', encoder.classes_)
print('디코딩 원본값 : ', encoder.inverse_transform([4, 5, 2, 0, 1, 1, 3, 3])) # 인코딩 값을 다시 디코딩 : inverse_transform()

Out[14]:

LabelEncoder()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

인코딩 변환값 :  [0 1 4 5 3 2 2]
인코딩 클래스 :  ['TV' '냉장고' '믹서' '선풍기' '전자레인지' '컴퓨터']
디코딩 원본값 :  ['전자레인지' '컴퓨터' '믹서' 'TV' '냉장고' '냉장고' '선풍기' '선풍기']

원-핫-인코딩(One-Hot-Encoding)

피쳐 값의 유형에 따라 새로운 피처를 추가해 고유 값에 해당하는 컬럼에만 1을 표시하고 나머지 칼럼에는 0을 표시

In [15]:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

items = ['TV', '냉장고', '전자레인지', '컴퓨터', '선풍기', '선풍기', '믹서', '믹서']

items = np.array(items).reshape(-1, 1) # 2차원 ndarray로 변환
print(items)

oh_encoder = OneHotEncoder()
oh_encoder.fit(items)
oh_labels = oh_encoder.transform(items)

# onehotencoder로 변환한 결과는 희소행렬이므로 toarray()를 이용해 밀집 행렬로 변환
print('원-핫-인코딩 데이터')
print(oh_labels.toarray())
print('\n원-핫-인코딩 데이터 차원')
print(oh_labels.shape)

[['TV']
 ['냉장고']
 ['전자레인지']
 ['컴퓨터']
 ['선풍기']
 ['선풍기']
 ['믹서']
 ['믹서']]

Out[15]:

OneHotEncoder()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

원-핫-인코딩 데이터
[[1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]]

원-핫-인코딩 데이터 차원
(8, 6)

In [16]:

# get_dummies()이용

df = pd.DataFrame({'item' : ['TV', '냉장고', '전자레인지', '컴퓨터', '선풍기', '선풍기', '믹서', '믹서']})

pd.get_dummies(df)

Out[16]:

	item_TV	item_냉장고	item_믹서	item_선풍기	item_전자레인지	item_컴퓨터
0	True	False	False	False	False	False
1	False	True	False	False	False	False
2	False	False	False	False	True	False
3	False	False	False	False	False	True
4	False	False	False	True	False	False
5	False	False	False	True	False	False
6	False	False	True	False	False	False
7	False	False	True	False	False	False

피쳐 스케일링과 정규화¶

StandardScaler

In [18]:

from sklearn.datasets import load_iris

iris = load_iris()
iris_data = iris.data
iris_df = pd.DataFrame(data = iris_data, columns = iris.feature_names)

print('feature들의 평균 값')
print(iris_df.mean())
print('\n feature들의 분산 값')
print(iris_df.var())

feature들의 평균 값
sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
dtype: float64

 feature들의 분산 값
sepal length (cm)    0.685694
sepal width (cm)     0.189979
petal length (cm)    3.116278
petal width (cm)     0.581006
dtype: float64

In [19]:

from sklearn.preprocessing import StandardScaler

# StandardScaler 객체 생성
scaler = StandardScaler()
scaler.fit(iris_df)
iris_scaled = scaler.transform(iris_df)

# transform()시 스케일 변환된 데이터 세트가 Numpy ndarray로 반환돼 이를 Dataframe으로 변환
iris_df_scaled = pd.DataFrame(data = iris_scaled, columns = iris.feature_names)

print('feature들의 평균 값')
print(iris_df_scaled.mean())
print('\n feature들의 분산 값')
print(iris_df_scaled.var())

Out[19]:

StandardScaler()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

feature들의 평균 값
sepal length (cm)   -1.690315e-15
sepal width (cm)    -1.842970e-15
petal length (cm)   -1.698641e-15
petal width (cm)    -1.409243e-15
dtype: float64

 feature들의 분산 값
sepal length (cm)    1.006711
sepal width (cm)     1.006711
petal length (cm)    1.006711
petal width (cm)     1.006711
dtype: float64

MinMaxScaler

In [20]:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(iris_df)
iris_scaled = scaler.transform(iris_df)

iris_df_scaled = pd.DataFrame(data = iris_scaled, columns = iris.feature_names)

print('feature들의 최솟값')
print(iris_df_scaled.min())
print('feature들의 최댓값')
print(iris_df_scaled.max())

Out[20]:

MinMaxScaler()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

feature들의 최솟값
sepal length (cm)    0.0
sepal width (cm)     0.0
petal length (cm)    0.0
petal width (cm)     0.0
dtype: float64
feature들의 최댓값
sepal length (cm)    1.0
sepal width (cm)     1.0
petal length (cm)    1.0
petal width (cm)     1.0
dtype: float64

학습 데이터와 테스트 데이터의 스케일링 변환 시 유의점¶

Scaler 객체를 사용해 학습 데이터 세트로 fit(), transform()을 적용하면 테스트 데이터 세트로는 fit()을 하지 말고 transform()을 적용한다

=> 테스트 데이터 세트는 학습 데이터 세트로 fit()한 결과를 이용해 transform()을 진행한다

테스트 데이터에 fit()을 적용할 때 발생하는 문제점 알아보기

In [26]:

from sklearn.preprocessing import MinMaxScaler

# 학습 데이터 0 ~ 10, 테스트 데이터 0 ~ 5
train_array = np.arange(0, 11).reshape(-1, 1)
test_array = np.arange(0, 6).reshape(-1, 1)

scaler = MinMaxScaler()
scaler.fit(train_array) # 최솟값 : 0, 최댓값 : 10
train_scaled = scaler.transform(train_array) # 1 / 10 scaler로 train_array 데이터 변환

print('원본 train_array 데이터 : ', np.round(train_array.reshape(-1), 2))
print('scale된 train_array 데이터 : ', np.round(train_scaled.reshape(-1), 2))      

Out[26]:

MinMaxScaler()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

원본 train_array 데이터 :  [ 0  1  2  3  4  5  6  7  8  9 10]
scale된 train_array 데이터 :  [0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]

In [28]:

scaler.fit(test_array) # 원본 데이터의 최솟값 : 0, 최댓값 : 5로 설정
test_scaled = scaler.transform(test_array) # 원본 5가 1보 변환

print('원본 test_array 데이터 : ', np.round(test_array.reshape(-1), 2))
print('scale된 test_array 데이터 : ', np.round(test_scaled.reshape(-1), 2))     

Out[28]:

MinMaxScaler()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

원본 test_array 데이터 :  [0 1 2 3 4 5]
scale된 test_array 데이터 :  [0.  0.2 0.4 0.6 0.8 1. ]

=> 학습 데이터와 테스트 데이터의 스케일링이 맞지 않다

In [32]:

# 데스트 데이터에 학습 데이터 세트로 fit()을 수행한 MinMaxScaler 객체의 transform()을 이용

scaler = MinMaxScaler()
scaler.fit(train_array)
train_scaled = scaler.transform(train_array)

print('원본 train_array 데이터 : ', np.round(train_array.reshape(-1), 2))
print('scale된 train_array 데이터 : ', np.round(train_scaled.reshape(-1), 2))

test_scaled = scaler.transform(test_array)

print('\n원본 test_array 데이터 : ', np.round(test_array.reshape(-1), 2))
print('scale된 test_array 데이터 : ', np.round(test_scaled.reshape(-1), 2))     

Out[32]:

MinMaxScaler()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

원본 train_array 데이터 :  [ 0  1  2  3  4  5  6  7  8  9 10]
scale된 train_array 데이터 :  [0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]

원본 test_array 데이터 :  [0 1 2 3 4 5]
scale된 test_array 데이터 :  [0.  0.1 0.2 0.3 0.4 0.5]

랜덤포레스트_분류 (0)	2024.01.23
xgboost_분류 (0)	2024.01.23
평가 (1)	2024.01.22
피마 인디언 당뇨병 예측 (0)	2024.01.16
타이타닉 생존자 예측 (0)	2024.01.15

개발하고 싶어요

개발하고 싶어요

붓꽃 품종 예측하기 본문

붓꽃 품종 예측하기

첫 번째 머신러닝 만들어 보기 - 붓꽃 품종 예측하기¶

사이킷런의 기반 프레임워크 익히기¶

Model Selection 모듈 소개¶

학습/테스트 데이터 세트 분리 - train_test_split()¶

교차 검증¶

GridSearchCV - 교차 검증과 최적 하이퍼파라미터 튜닝을 한 번에¶

데이터 전처리¶

데이터 인코딩¶

피쳐 스케일링과 정규화¶

학습 데이터와 테스트 데이터의 스케일링 변환 시 유의점¶

'ML' 카테고리의 다른 글

티스토리툴바

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2