Notice

Recent Posts

Archives

관리 메뉴

개발하고 싶어요

차원축소 - PCA 본문

차원축소 - PCA

yuurimingg 2024. 2. 5. 17:20

차원 축소¶

일반적으로 차원 축소는 피처 선택, 피처 추출로 나눌 수 있다

피처 선택 : 특성 선택은 말 그대로 특정 피처에 종속성이 강한 불필요한 피처는 아예 제거하고, 데이터의 특징을 잘 나타내는 주요 피처만 선택

피처 추출 : 기존 피처를 저차원의 중요 피처로 압축해서 추출, 기존 피처를 단순 압축이 아닌 피처를 함축적으로 더 잘 설명할 수 있는 또 다른 공간으로 매핑해 추출

PCA¶

PCA는 가장 대표적인 차원 축소 기법이다

여러 변수 간에 존재하는 상관관계를 이용해 이를 대표하는 주성분을 추출해 차원을 축소하는 기법

PCA는 다음과 같은 스텝으로 수행

입력 데이터 세트의 공분산 행렬을 생성
공분산 행렬의 고유벡터와 고유값을 계산
고유값이 가장 큰 순으로 K개만큼 고유벡터를 추출
고유값이 가장 큰 순으로 추출된 고유벡터를 이용해 새롭게 입력 데이터를 변환

붓꽃 데이터를 사용해 4개의 속성을 2개의 PCA 차원으로 압축해 원래 데이터 세트와 압축된 데이터 세트가 어떻게 다른지 확인

In [2]:

from sklearn.datasets import load_iris
import pandas as pd
import matplotlib.pyplot as plt

iris = load_iris()

# 넘파이 데이터 세트를 판다스 DataFrame로 변환
columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
df = pd.DataFrame(iris.data, columns = columns)
df['target'] = iris.target

df.head()

Out[2]:

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

In [6]:

iris.keys()

Out[6]:

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [3]:

# 각 품종에 따라 원본 붓꽃 데이터 세트가 어떻게 분포돼 있는지 확인(sepal_length, sepal_width)
# setosa는 세모, versicolor는 네모, virginica는 동그라미
markers = ['^', 's', 'o']

# setosa의 target값은 0, versicolor는 1, virginica는 2
# 각 target별로 다른 모양으로 산점도 표시

for i, marker in enumerate(markers):
    x_axis_data = df[df['target'] == i]['sepal_length']
    y_axis_data = df[df['target'] == i]['sepal_width']
    plt.scatter(x_axis_data, y_axis_data, marker = marker, label = iris.target_names[i])
    
plt.legend()
plt.xlabel('sepal_length')
plt.ylabel('sepal_width')

Out[3]:

Text(0, 0.5, 'sepal_width')

pca로 4개 속성을 2개로 압축한 뒤 앞의 예제와 비슷하게 2개의 pca 속성으로 붓꽃 데이터의 품종 분포를 2차원으로 시각화

붓꽃 데이터 세트에 바로 pca를 적용하기 전에 개별 속성을 함께 스케일링을 진행

In [7]:

from sklearn.preprocessing import StandardScaler

# Target 값을 제외한 모든 속성 값을 StandardScaler를 이용해 표준 정규 분포를 가지는 값들로 변환
iris_scaled = StandardScaler().fit_transform(df.iloc[:, :-1])

In [8]:

from sklearn.decomposition import PCA

pca = PCA(n_components = 2)

# fit()과 transform()을 호출해 PCA 변환 데이터 반환
pca.fit(iris_scaled) # 스케일링 된 iris 데이터세트 학습
iris_pca = pca.transform(iris_scaled) # 스케일 된 iris 데이터세트 변환(pca)
print(iris_pca.shape)

(150, 2)

iris_pca는 변환된 PCA 데이터 세트를 150 X 2 넘파이 행렬로 가지고 있다

이를 DataFrame으로 변환한 뒤 데이터값을 확인

In [11]:

# PCA로 변환된 데이터의 컬럼명을 각각 pca_component_1, pca_component_2로 변경
pca_columns = ['pca_component_1', 'pca_component_2']
iris_df_pca = pd.DataFrame(iris_pca, columns = pca_columns)
iris_df_pca['target'] = iris.target
iris_df_pca.head()

Out[11]:

	pca_component_1	pca_component_2
0	-2.264703	0.480027
1	-2.080961	-0.674134
2	-2.364229	-0.341908
3	-2.299384	-0.597395
4	-2.389842	0.646835

In [13]:

# setosa는 세모, versicolor는 네모, virginica는 동그라미
markers = ['^', 's', 'o']


# pca_component_1을 x축, pca_component_2를 y축으로 scatter plot 수행
for i, marker in enumerate(markers):
    x_axis_data = iris_df_pca[iris_df_pca['target'] == i]['pca_component_1']
    y_axis_data = iris_df_pca[iris_df_pca['target'] == i]['pca_component_2']
    plt.scatter(x_axis_data, y_axis_data, marker = marker, label = iris.target_names[i])
    
plt.legend()
plt.xlabel('pca_component_1')
plt.ylabel('pca_component_2')

Out[13]:

Text(0, 0.5, 'pca_component_2')

In [14]:

# pca component별로 원본 데이터의 변동성을 얼마나 반영하고 있는지 확인

pca.explained_variance_ratio_

Out[14]:

array([0.72962445, 0.22850762])

Estimator는 RandomForestClassifier를 이용하고 cross_val_score()로 3개의 교차 검증 세트로 정확도 결과를 비교

In [15]:

# 원본 붓꽃 데이터 데이터엔 랜덤포레스트를 적용

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

rf = RandomForestClassifier(random_state = 156)
scores = cross_val_score(rf, iris.data, iris.target, scoring = 'accuracy', cv = 3)
print('원본 데이터 교차 검증 개별 정확도 : ', scores)
print('원본 데이터 평균 정확도 : ', np.mean(scores))

원본 데이터 교차 검증 개별 정확도 :  [0.98 0.94 0.96]
원본 데이터 평균 정확도 :  0.96

In [16]:

# 기존 4차원 데이터를 2차원으로 pca 변환한 데이터 세트에 랜덤 포레스트를 적용

pca_x = iris_df_pca[['pca_component_1', 'pca_component_2']]
scores_pca = cross_val_score(rf, pca_x, iris.target, scoring = 'accuracy', cv = 3)
print('PCA 변환 데이터 교차 검증 개별 정확도 : ', scores_pca)
print('PCA 변환 데이터 평균 정확도 : ', np.mean(scores_pca))

PCA 변환 데이터 교차 검증 개별 정확도 :  [0.88 0.88 0.88]
PCA 변환 데이터 평균 정확도 :  0.88

원본 데이터 세트 대비 예측 정확도는 PCA 변환 차원 개수에 따라 예측 성능이 떨어질 수 밖에 없다

다음으로는 좀 더 많은 피처를 가진 데이터 세트를 적은 PCA 컴포넌트 기반으로 변환한 뒤, 예측 영향도가 어떻게 되는지 변환된 PCA 데이터 세트에 기반해서 비교

사용할 데이터 세트는 신용카드 고객 데이터 세트 사용

신용카드 고객 데이터 세트¶

In [22]:

# ! pip install xlrd>=2.0.1

In [23]:

# 데이터 시트를 DataFrame로 로딩
# header로 의미 없는 첫 행 제거, iloc으로 기존 id 제거

import pandas as pd

df = pd.read_excel('./data/pca_credit_card.xls', header = 1, sheet_name = 'Data').iloc[0:,1:]
print(df.shape)
df.head()

(30000, 24)

Out[23]:

	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	PAY_5	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default payment next month
0	20000	2	2	1	24	2	2	-1	-1	-2	...	0	0	0	0	689	0	0	0	0	1
1	120000	2	2	2	26	-1	2	0	0	0	...	3272	3455	3261	0	1000	1000	1000	0	2000	1
2	90000	2	2	2	34	0	0	0	0	0	...	14331	14948	15549	1518	1500	1000	1000	1000	5000	0
3	50000	2	2	1	37	0	0	0	0	0	...	28314	28959	29547	2000	2019	1200	1100	1069	1000	0
4	50000	1	2	1	57	-1	0	-1	0	0	...	20940	19146	19131	2000	36681	10000	9000	689	679	0

5 rows × 24 columns

'default payment next month'속성이 target 값으로 '다음달 연체 여부'를 의미하며 '연체'일 경우1, '정상납부'가 0이다

PAY_0다음에 PAY_2컬럼이 있으므로 PAY_0컬럼을 PAY_1로 변경

In [25]:

df.rename(columns = {'PAY_0' : 'PAY_1', 'default payment next month' : 'default'}, inplace = True)
y_target = df['default']
X_features = df.drop('default', axis = 1)

In [26]:

# 속성 간의 상관도를 구한 뒤 그리기

import seaborn as sns
import matplotlib.pyplot as plt

corr = X_features.corr()
plt.figure(figsize = (14, 14))
sns.heatmap(corr, annot = True, fmt = '.1g')

Out[26]:

<Axes: >

BILL_AMT1 ~ BILL_AMT6 6개 속성끼리의 상관도가 대부분 0.9 이상으로 매우 높다

이보다는 낮지만 PAY_1 ~ PAY_6까지의 속성 역시 상관도가 높다

BILL_AMT1 ~ BILL_AMT6까지 6개 속성을 2개의 컴포넌트로 PCA 변환한 뒤 개별 컴포넌트의 변동성을 explained_variance_ratio 속성으로 알아보기

In [36]:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# BILL_ATM1 ~ BILL_ATM6까지 6개의 속성명 생성
cols_bill = ['BILL_AMT'+str(i) for i in range(1, 7)]
print('대상 속성명 : ', cols_bill)

# 2개의 PCA 속성을 가진 PCA 객체 생성하고, explained_variance_ratio_계산을 위해 fit()호출
scaler = StandardScaler()
df_cols_scaled = scaler.fit_transform(X_features[cols_bill])
pca = PCA(n_components = 2)
pca.fit(df_cols_scaled)
print('PCA Component별 변동성 : ', pca.explained_variance_ratio_)

대상 속성명 :  ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']
PCA Component별 변동성 :  [0.90555253 0.0509867 ]

단 2개의 PCA 컴포넌트만으로도 6개 속성의 변동성을 약 95% 이상 설명할 수 있다

원본 데이터 세트와 6개의 컴포넌트로 PCA 변환한 데이터 세트의 분류 예측 결과를 상호 비교

In [40]:

# 원본 데이터 세트에 랜덤 포레스트를 이용해 타깃 값이 디폴트 값을 3개의 교차 검증 세트로 분류 예측

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier(n_estimators = 300, random_state = 156)
scores = cross_val_score(rf, X_features, y_target, scoring = 'accuracy', cv = 3)

print('CV = 3인 경우의 개별 Fold세트별 정확도 : ', scores)
print('평균 정확도 : {:.4f}'.format(np.mean(scores)))

CV = 3인 경우의 개별 Fold세트별 정확도 :  [0.8083 0.8196 0.8232]
평균 정확도 : 0.8170

In [41]:

# 6개의 컴포넌트로 PCA 변환한 데이터 세트에 대해서 동일하게 분류 예측 적용

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# 원본 데이터 세트에 먼저 StandardScaler 적용
scaler = StandardScaler()
df_scaled = scaler.fit_transform(X_features)

# 6새의 컴포넌트를 가진 PCA 변환을 수행하고 cross_val_score()로 분류 예측 수행
pca = PCA(n_components = 6)
df_pca = pca.fit_transform(df_scaled)
scores_pca = cross_val_score(rf, df_pca, y_target, scoring = 'accuracy', cv = 3)

print('CV = 3인 경우의 PCA 변환된 개별 Fold 세트별 정확도 : ', scores_pca)
print('PCA 변환 데이터 세트 평균 정확도 : {:.4f}'.format(np.mean(scores_pca)))

CV = 3인 경우의 PCA 변환된 개별 Fold 세트별 정확도 :  [0.7914 0.7976 0.8028]
PCA 변환 데이터 세트 평균 정확도 : 0.7973

'ML' 카테고리의 다른 글

차원축소 - LDA, SVD, NMF (1)	2024.02.06
회귀트리 (0)	2024.01.30
로지스틱 회귀 (0)	2024.01.30
릿지, 라쏘, 엘라스틱 (0)	2024.01.29
다항회귀 (0)	2024.01.29

'ML' Related Articles

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2