필요 라이브러리¶

In [68]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import f1_score, confusion_matrix, precision_recall_curve, roc_curve
import matplotlib.ticker as ticker
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Binarizer

데이터 불러오기¶

In [4]:

df = pd.read_csv('./data/diabetes.csv')

df.head(3)

Out[4]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	33.6	0.627	50	1
1	1	85	66	29	26.6	0.351	31	0
2	8	183	64	0	23.3	0.672	32	1

'Pregnancies' : 임신 횟수

'Glucose' : 포도당 부하 검사 수치

'BloodPressure' : 혈압

'SkinThickness' : 팔 삼두근 뒤쪽의 피하지방 측정값

'Insulin', : 혈청 인슐린

'BMI' : 체지방량

'DiabetesPedigreeFunction' : 당뇨 내력 가중치 값

'Age' : 나이

'Outcome' : 클래스 결정 값

데이터 확인¶

In [6]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

In [8]:

df['Outcome'].value_counts()

Out[8]:

Outcome
0    500
1    268
Name: count, dtype: int64

=> 전체 768 중 Positive는 268, Negative는 500개이다

모델링¶

데이터 나누기¶

In [12]:

df.columns

Out[12]:

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [16]:

# 피쳐 데이터 세트 X, 레이블 데이터 세트 y를 추출

feature = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target = ['Outcome']

X = df[feature]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 156, stratify = y) # y에서 Negative가 상대적으로 많아 분포를 맞추기 위해 stratify를 사용

모델 학습, 예측, 평가¶

In [25]:

# 성능 평가 지표 함수
def get_clf_eval(y_test, pred, pred_proba):
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    auc = roc_auc_score(y_test, pred_proba)
    
    
    print('오차행렬')
    print(confusion)
    print('정확도 : {0:.4f}, 정밀도 : {1:.4f}, 재현율 : {2:.4f}, F1 : {3:.4f}, AUC : {4:.4f}'.format(accuracy, precision, recall, f1, auc))

In [26]:

# 로지스틱 회귀로 학습, 예측, 평가

lr = LogisticRegression(solver = 'liblinear') 
lr.fit(X_train, y_train)
pred = lr.predict(X_test) # 값
pred_proba = lr.predict_proba(X_test)[:, 1] # 확률

get_clf_eval(y_test, pred, pred_proba) # 성능 평가 지표 함수

C:\Users\mit005\anaconda3\Lib\site-packages\sklearn\utils\validation.py:1184: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Out[26]:

LogisticRegression(solver='liblinear')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

오차행렬
[[87 13]
 [22 32]]
정확도 : 0.7727, 정밀도 : 0.7111, 재현율 : 0.5926, F1 : 0.6465, AUC : 0.8083

=> 데이터 중 65%가 Negative이므로 재현율을 중요하게 생각한다

임계값 별 정밀도와 재현율의 값을 그래프로 나타내기

In [29]:

def precision_recall_curve_plot(y_test, pred_proba_c1):
    
    precision, recalls, thresholds = precision_recall_curve(y_test, pred_proba_c1) # threshold ndarray와 이 threshold에 다른 정밀도, 재현율 ndarray 추출
    
    # x축을 threshold값으로 y축은 정밀도, 재현율 값으로 각각 plot 수행, 정밀도는 점선으로 표시
    plt.figure(figsize = (8, 6))
    threshold_boundary = thresholds.shape[0]
    plt.plot(thresholds, precision[0:threshold_boundary], linestyle = '--', label = 'precision') # 정밀도
    plt.plot(thresholds, recalls[0:threshold_boundary], label = 'recall') # 재현율
    
    # threhold 값 x축의 scale을 0.1 단위로 변경
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1), 2))
    
    # x축, y축 label과 legend, 그리고 grid 설정
    plt.xlabel('Threshold value')
    plt.ylabel('Precision and Recall value')
    plt.legend()
    plt.grid()

In [30]:

# 재현율 곡선을 보고 임계값 별 정밀도와 재현율 값의 변화 확인

pred_proba_c1 = lr.predict_proba(X_test)[:, 1]
precision_recall_curve_plot(y_test, pred_proba_c1)

=> 임계값을 0.42 ~ 0.45로 낮추면 정밀도와 재현율이 어느 정도 균형을 맞출 수 있다

=> 하지만 0.7정도로 낮다

=> 데이터를 수정하기

데이터 전처리¶

In [31]:

df.describe()

Out[31]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

=> min의 값이 0인 feature가 많다

In [34]:

# 'Glucose' 데이터 살펴보기

plt.hist(df['Glucose'], bins = 100);

=> 0인 값이 존재

=> 전체 데이터 건수 대비 몇 퍼센트의 비율이 존재하는지 확인

In [41]:

df[df['Glucose'] == 0]['Glucose'].count()

Out[41]:

In [42]:

# 0값이 있는 feature리스트

zero_feature_lst = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

# 전체 데이터 수
tot_cnt = df['Glucose'].count()

# feature별 데이터 값이 0인 데이터 건수를 추출하고 비율을 계산
for feature in zero_feature_lst:
    zero_cnt = df[df[feature] == 0][feature].count()
    print('{}에서 데이터의 값이 0인 건수는 {}, 비율은 {:.2f}%'.format(feature, zero_cnt, zero_cnt / tot_cnt * 100))

Glucose에서 데이터의 값이 0인 건수는 5, 비율은 0.65%
BloodPressure에서 데이터의 값이 0인 건수는 35, 비율은 4.56%
SkinThickness에서 데이터의 값이 0인 건수는 227, 비율은 29.56%
Insulin에서 데이터의 값이 0인 건수는 374, 비율은 48.70%
BMI에서 데이터의 값이 0인 건수는 11, 비율은 1.43%

=> SkinThickness와 Insulin가 많다

=> 삭제하기엔 양이 많으므로 평균값으로 대체

In [45]:

# 0값을 각 feature의 평균 값으로 대체

mean_zero_features = df[zero_feature_lst].mean() # 각 feature의 평균
mean_zero_features

# 평균으로 변경
df[zero_feature_lst] = df[zero_feature_lst].replace(0, mean_zero_features)

Out[45]:

Glucose          120.894531
BloodPressure     69.105469
SkinThickness     20.536458
Insulin           79.799479
BMI               31.992578
dtype: float64

스케일링¶

In [62]:

feature = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
target = ['Outcome']

X = df[feature]
y = df[target]

# feature 데이터 세트에 스케일링 적용
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state = 156, stratify = y)

데이터 처리 후 모델링¶

In [64]:

# 로지스틱 회귀로 학습, 예측, 평가 수행

lr = LogisticRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)
pred_proba = lr.predict_proba(X_test)[:, 1]

get_clf_eval(y_test, pred, pred_proba) # 성능 평가 지표 함수

C:\Users\mit005\anaconda3\Lib\site-packages\sklearn\utils\validation.py:1184: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Out[64]:

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

오차행렬
[[90 10]
 [21 33]]
정확도 : 0.7987, 정밀도 : 0.7674, 재현율 : 0.6111, F1 : 0.6804, AUC : 0.8433

=> 데이터 처리 후 재현율이 올랐다

=> 모델 튜닝

모델 튜닝¶

In [91]:

# 임계값에 다른 평가 수치 출력 함수

def get_eval_by_thresholds(y_test, pred_proba_c1, thresholds): # pred_proba_c1는 1이라 했을 때 값을 의미한다
    
    # thresholds list 객체 내의 값을 차례로 iteration하면서 Evaluation을 수행
    for custom_threshold in thresholds:
        binarizer = Binarizer(threshold = custom_threshold) # 임계값 보다 크면 0
        custom_predict = binarizer.fit_transform(pred_proba_c1) # 값이 0, 1
        print('\n임계값 : ', custom_threshold)
        get_clf_eval(y_test, custom_predict, pred_proba[:, 1]) # 성능 평가 지표 함수

In [92]:

# 임계값에 따른 평가 수치를 출력

thresholds = [0.3, 0.33, 0.36, 0.42, 0.45, 0.48, 0.5]
pred_proba = lr.predict_proba(X_test)
get_eval_by_thresholds(y_test, pred_proba[:, 1].reshape(-1, 1), thresholds)

임계값 :  0.3
오차행렬
[[67 33]
 [11 43]]
정확도 : 0.7143, 정밀도 : 0.5658, 재현율 : 0.7963, F1 : 0.6615, AUC : 0.8433

임계값 :  0.33
오차행렬
[[72 28]
 [12 42]]
정확도 : 0.7403, 정밀도 : 0.6000, 재현율 : 0.7778, F1 : 0.6774, AUC : 0.8433

임계값 :  0.36
오차행렬
[[76 24]
 [15 39]]
정확도 : 0.7468, 정밀도 : 0.6190, 재현율 : 0.7222, F1 : 0.6667, AUC : 0.8433

임계값 :  0.42
오차행렬
[[84 16]
 [18 36]]
정확도 : 0.7792, 정밀도 : 0.6923, 재현율 : 0.6667, F1 : 0.6792, AUC : 0.8433

임계값 :  0.45
오차행렬
[[85 15]
 [18 36]]
정확도 : 0.7857, 정밀도 : 0.7059, 재현율 : 0.6667, F1 : 0.6857, AUC : 0.8433

임계값 :  0.48
오차행렬
[[88 12]
 [19 35]]
정확도 : 0.7987, 정밀도 : 0.7447, 재현율 : 0.6481, F1 : 0.6931, AUC : 0.8433

임계값 :  0.5
오차행렬
[[90 10]
 [21 33]]
정확도 : 0.7987, 정밀도 : 0.7674, 재현율 : 0.6111, F1 : 0.6804, AUC : 0.8433

=> 임계값 0.48일 때 그나마 좋다

=> 임계값을 0.48로 낮추어 로지스틱 모델로 다시 예측한다

In [83]:

# 임계값을 0.48로 설정한 Binarizer 생성

binarizer = Binarizer(threshold = 0.48)
pred_th_048 = binarizer.fit_transform(pred_proba[:, 1].reshape(-1, 1))

get_clf_eval(y_test, pred_th_048, pred_proba[:, 1]) # 성능 평가 지표 함수

오차행렬
[[88 12]
 [19 35]]
정확도 : 0.7987, 정밀도 : 0.7447, 재현율 : 0.6481, F1 : 0.6931, AUC : 0.8433

결론¶

아무것도 하지 않고 모델링 했을 경우

정확도 : 0.7727, 정밀도 : 0.7111, 재현율 : 0.5926, F1 : 0.6465, AUC : 0.8083

데이터 전처리 후 모델링 했을 경우

정확도 : 0.7987, 정밀도 : 0.7674, 재현율 : 0.6111, F1 : 0.6804, AUC : 0.8433

모델 튜닝 후 모델링 했을 경우

정확도 : 0.7987, 정밀도 : 0.7447, 재현율 : 0.6481, F1 : 0.6931, AUC : 0.8433

In [ ]:

티스토리

피마 인디언 당뇨병 예측