차원축소 - LDA, SVD, NMF

yuurimingg 2024. 2. 6. 11:28

LDA(Linear Discriminant Analysis)¶

지도 학습의 분류에서 사용하기 쉽도록 개별 클래스를 분별할 수 있는 기준을 최대한 유지하면서 차원을 축소한다

특정 공간상에서 클래스 분리를 최대화하는 축을 찾기 위해 클래스 간 분산과 클래스 내부 분산의 비율을 최대화하는 방식으로 차원을 축소

붓꽃 데이터 세트에 LDA 적용하기¶

In [2]:

# 붓꽃 데이터 세트를 로드하고 표준 정규 분포로 스케일링

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris() # 데이터 불러오기
iris_scaled = StandardScaler().fit_transform(iris.data) # 데이터의 feature를 스케일링 진횅

2개의 컴포넌트로 붓꽃 데이터를 LDA 변환

클래스의 결정 값이 변환시 필요 -> 다음 LDA 객체의 fit()메서드를 호출할 때 결정값이 입력됐음에 유의(지도학습)

In [3]:

lda = LinearDiscriminantAnalysis(n_components = 2) # lda 객체 생성
lda.fit(iris_scaled, iris.target) # lda 학습(스케일링 데이터, 결정값(target))
iris_lda = lda.transform(iris_scaled)

print(iris_lda.shape)

(150, 2)

In [4]:

# LDA 변환된 입력 데이터 값을 2차원 평면에 품종별로 표현

import pandas as pd
import matplotlib.pyplot as plt

lda_columns = ['lda_component_1', 'lda_component_2']
iris_df_lda = pd.DataFrame(iris_lda, columns = lda_columns)
iris_df_lda['target'] = iris.target

# setosa는 세모, versicolor는 네모, virginica는 동그라미로 표현

markers = ['^', 's', 'o']

# setosa의 targat값은 0, versicolor는 1, virginica는 2
# 각 target별로 다른 모양으로 산점도로 표시

for i, marker in enumerate(markers):
    x_axis_data = iris_df_lda[iris_df_lda['target'] == i]['lda_component_1']
    y_axis_data = iris_df_lda[iris_df_lda['target'] == i]['lda_component_2']
    
    plt.scatter(x_axis_data, y_axis_data, marker = marker, label = iris.target_names[i])
    
plt.legend()
plt.xlabel('lda_component_1')
plt.ylabel('lda_component_2')

Out[4]:

Text(0, 0.5, 'lda_component_2')

SVD(Singular Value Decomposition)¶

행렬 분해 기법을 이용

정방행렬뿐 아니라 행과 열의 크기가 다른 행렬도 적용 가능

일반적인 SVD는 m X n 크기의 행렬 A를 다음과 같이 분해한다

$ A = U \sum{V ^ T} $

SVD는 특이값 분해로 불리며 행렬 U, V에 속한 벡터는 특이 벡터, 모든 특이 벡터는 서로 직교

$ \sum $은 대각행렬로 행렬의 대각에 위치한 값만 0이 아니고 나머지 위치의 값은 모두 0

=> A의 차원이 m X n일 때, U의 차원을 m X p, $\sum$의 차원을 p X p, $V^T$의 차원을 p X n으오 분해

In [6]:

# 넘파이의 SVD 모듈 임포트

import numpy as np
from numpy.linalg import svd

# 4X4 랜덤 행렬 a 생성
np.random.seed(121)
a = np.random.randn(4, 4)
print(np.round(a, 3))

[[-0.212 -0.285 -0.574 -0.44 ]
 [-0.33   1.184  1.615  0.367]
 [-0.014  0.63   1.71  -1.327]
 [ 0.402 -0.191  1.404 -1.969]]

생성된 a행렬에 SVD를 적용해 U, sigma, Vt를 도출

In [9]:

U, Sigma, Vt = svd(a)
print(U.shape, Sigma.shape, Vt.shape)
print('\nU matrix : \n', np.round(U, 3))
print('\nSigma matrix : \n', np.round(Sigma, 3)) # 대각에 위치한 값만 0이 아니고, 그렇지 않은 경우는 모두 0이므로 0이 아닌 값의 경우만 1차원 행렬로 표시
print('\nVt matrix : \n', np.round(Vt, 3))

(4, 4) (4,) (4, 4)

U matrix : 
 [[-0.079 -0.318  0.867  0.376]
 [ 0.383  0.787  0.12   0.469]
 [ 0.656  0.022  0.357 -0.664]
 [ 0.645 -0.529 -0.328  0.444]]

Sigma matrix : 
 [3.423 2.023 0.463 0.079]

Vt matrix : 
 [[ 0.041  0.224  0.786 -0.574]
 [-0.2    0.562  0.37   0.712]
 [-0.778  0.395 -0.333 -0.357]
 [-0.593 -0.692  0.366  0.189]]

In [14]:

# 분해된 U, Sigma, Vt를 다시 원본 행렬로 복원
# sigma의 경우 0이 아닌 값만 1차원으로 추출했으므로 다시 0을 포함한 대칭행렬로 변환한 뒤에 내적을 수행
# sigma를 다시 0으로 포함한 대칭행렬로 변환

Sigma_mat = np.diag(Sigma)
print(np.round(Sigma_mat, 3))

print()

a_ = np.dot(np.dot(U, Sigma_mat), Vt) # 내적
print(np.round(a_, 3))

[[3.423 0.    0.    0.   ]
 [0.    2.023 0.    0.   ]
 [0.    0.    0.463 0.   ]
 [0.    0.    0.    0.079]]

[[-0.212 -0.285 -0.574 -0.44 ]
 [-0.33   1.184  1.615  0.367]
 [-0.014  0.63   1.71  -1.327]
 [ 0.402 -0.191  1.404 -1.969]]

데이터 세트가 로우 간 의존성이 있을 경우 어떻게 sigma값이 변하고 이에 다른 차원 축소가 진행될 수 있는지 확인

In [15]:

a[2] = a[0] + a[1]
a[3] = a[0]
print(np.round(a, 3))

[[-0.212 -0.285 -0.574 -0.44 ]
 [-0.33   1.184  1.615  0.367]
 [-0.542  0.899  1.041 -0.073]
 [-0.212 -0.285 -0.574 -0.44 ]]

In [16]:

# 다시 SVD를 수행해 Sigma값 확인

U, Sigma, Vt = svd(a)
print(U.shape, Sigma.shape, Vt.shape)
print('\nSigma Value : \n', np.round(Sigma, 3))

(4, 4) (4,) (4, 4)

Sigma Value : 
 [2.663 0.807 0.    0.   ]

이번에는 U, Sigma, Vt의 전체 데이터를 이용하지 않고 Sigma의 0에 대응되는 U, Sigma, Vt의 데이터를 제외하고 복원

-> Sigma의 경우 앞의 2개 요소만 0이 아니므로 U 행렬 중 선행 두 개의 열만 추출하고 Vt의 경우는 선행 두 개의 행말 추출해 복원

In [18]:

# U의 행렬의 경우 Sigma와 내적을 수행하므로 Sigma의 앞 2행에 대응되는 앞 2열만 추출

U_ = U[:, :2]
Sigma_ = np.diag(Sigma[:2])

# V 전치 행렬의 경우는 앞 2행만 추출
Vt_ = Vt[:2]
print(U_.shape, Sigma_.shape, Vt_.shape)

print()

# U, Sigma, Vt의 내적을 수행하며, 다시 원본 행렬 복원
a_ = np.dot(np.dot(U_, Sigma_), Vt_)
print(np.round(a_, 3))

(4, 2) (2, 2) (2, 4)

[[-0.212 -0.285 -0.574 -0.44 ]
 [-0.33   1.184  1.615  0.367]
 [-0.542  0.899  1.041 -0.073]
 [-0.212 -0.285 -0.574 -0.44 ]]

Truncated SVD를 이용해 행렬을 분해

$ \sum $ 행렬에 있는 대각 원소, 즉 특이값 중 상위 일부 데이터만 추출해 분해

In [21]:

import numpy as np
from scipy.sparse.linalg import svds
from scipy.linalg import svd

# 원본 행렬을 출력하고 SVD를 적용할 경우 U, Sigma, Vt의 차원 확인
np.random.seed(121)
matrix = np.random.random((6, 6))
print('원본 행렬 : \n', matrix)
U, Sigma, Vt = svd(matrix, full_matrices = False)
print('\n 분해 행렬 차원 : ', U.shape, Sigma.shape, Vt.shape)


# Truncated SVD로 Sigma 행렬의 특이값을 4개로 하여 Truncated SVD 수행
num_components = 4
U_tr, Sigma_tr, Vt_tr = svds(matrix, k = num_components)
print('\nTruncated SVD 분해 행렬 차원 : ', U_tr.shape, Sigma_tr.shape, Vt_tr.shape)
print('\nTruncated SVD Sigma 값 행렬 : ', Sigma_tr)
matrix_tr = np.dot(np.dot(U_tr, np.diag(Sigma_tr)), Vt_tr) # output of TruncatedSVD

print('\nTruncated SVD로 분해 후 복원 행렬 : \n', matrix_tr)

원본 행렬 : 
 [[0.11133083 0.21076757 0.23296249 0.15194456 0.83017814 0.40791941]
 [0.5557906  0.74552394 0.24849976 0.9686594  0.95268418 0.48984885]
 [0.01829731 0.85760612 0.40493829 0.62247394 0.29537149 0.92958852]
 [0.4056155  0.56730065 0.24575605 0.22573721 0.03827786 0.58098021]
 [0.82925331 0.77326256 0.94693849 0.73632338 0.67328275 0.74517176]
 [0.51161442 0.46920965 0.6439515  0.82081228 0.14548493 0.01806415]]

 분해 행렬 차원 :  (6, 6) (6,) (6, 6)

Truncated SVD 분해 행렬 차원 :  (6, 4) (4,) (4, 6)

Truncated SVD Sigma 값 행렬 :  [0.55463089 0.83865238 0.88116505 3.2535007 ]

Truncated SVD로 분해 후 복원 행렬 : 
 [[0.19222941 0.21792946 0.15951023 0.14084013 0.81641405 0.42533093]
 [0.44874275 0.72204422 0.34594106 0.99148577 0.96866325 0.4754868 ]
 [0.12656662 0.88860729 0.30625735 0.59517439 0.28036734 0.93961948]
 [0.23989012 0.51026588 0.39697353 0.27308905 0.05971563 0.57156395]
 [0.83806144 0.78847467 0.93868685 0.72673231 0.6740867  0.73812389]
 [0.59726589 0.47953891 0.56613544 0.80746028 0.13135039 0.03479656]]

=> Truncated SVD로 분해된 행렬로 다시 복원할 경우 완벽하게 복원되지 않고 근사적으로 복원됨

사이킷런 TruncatedSVD 클래스를 이용한 변환¶

fit(), transform()을 호출해 원본 데이터를 몇 개의 주요 컴포넌트(Truncated SVD의 K컴포넌트 수)로 차원을 축소해 변환

In [22]:

from sklearn.decomposition import TruncatedSVD, PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

iris = load_iris()
iris_ftrs = iris.data
# 2개의 주요 컴포넌트로 TruncatedSVD변환
tsvd = TruncatedSVD(n_components = 2)
tsvd.fit(iris_ftrs)
iris_tsvd = tsvd.transform(iris_ftrs)

# 산점도 2차원으로 TruncatedSVD 변환된 데이터 표현, 품종을 색깔로 구분
plt.scatter(x = iris_tsvd[:, 0], y = iris_tsvd[:, 1], c = iris.target)
plt.xlabel('TruncatedSVD Component1')
plt.ylabel('TruncatedSVD Component2')

Out[22]:

Text(0, 0.5, 'TruncatedSVD Component2')

붓꽃 데이터를 스케일링으로 변환한 뒤에 TruncatedSVD 실행

In [24]:

from sklearn.preprocessing import StandardScaler

# 붓꽃 데이터를 StandardScaler로 변환
scaler = StandardScaler()
iris_scaled = scaler.fit_transform(iris_ftrs) # iris_ftrs : iris데이터의 feature

# 스케일링된 데이터를 기반으로 TruncatedSVD 변환 수행
tsvd = TruncatedSVD(n_components = 2)
tsvd.fit(iris_scaled)
iris_tsvd = tsvd.transform(iris_scaled)

# 스케일링된 데이터를 기반으로 PCA 변환 수행
pca = PCA(n_components = 2)
pca.fit(iris_scaled)
iris_pca = pca.transform(iris_scaled)

# TruncatedSVd 변환 데이터를 왼족에 PCA 변환 데이터를 오른쪽에 표현
fig, (ax1, ax2) = plt.subplots(figsize = (9, 4), ncols = 2)
ax1.scatter(x = iris_tsvd[:, 0], y = iris_tsvd[:, 1], c = iris.target)
ax2.scatter(x = iris_pca[:, 0], y = iris_pca[:, 1], c = iris.target)
ax1.set_title('Truncated SVD Transformed')
ax2.set_title('PCA Transformed')

Out[24]:

Text(0.5, 1.0, 'PCA Transformed')

두 개의 변환 행렬 값과 원본 속성별 컴포넌트 비율값을 실제로 서로 비교해 보면 거의 같음을 알 수 있다

In [25]:

print((iris_pca - iris_tsvd).mean())
print((pca.components_ - tsvd.components_).mean())

2.3422698965565777e-15
3.903127820947816e-18

NMF(Non - Negative Matrix Factorization)¶

Truncated SVD와 같이 낮은 랭크를 통한 행랼 근사 방식의 변형

원본 행렬 내의 모든 원소 값이 모두 양수라는 게 보장된다면 다음과 같이 좀 더 간단하게 두 개의 기반 양수 행렬로 분해될 수 있는 기법을 지칭(V ~ W X H (~ : 근삿값))

붓꽃 데이터를 NMF를 이용해 2개의 컴포넌트로 변환하고 이를 시각화 하기

In [30]:

from sklearn.decomposition import NMF
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

iris = load_iris()
iris_ftrs = iris.data
nmf = NMF(n_components = 2)
nmf.fit(iris_ftrs)
iris_nmf = nmf.transform(iris_ftrs)

print(iris_nmf[:5])

plt.scatter(x = iris_nmf[:, 0], y = iris_nmf[:, 1], c = iris.target)
plt.xlabel('NMF Component1')
plt.ylabel('NMF Component2')

[[0.41349967 0.10467301]
 [0.36544877 0.14097835]
 [0.37779974 0.10188458]
 [0.34998054 0.14896392]
 [0.41589583 0.0953063 ]]

Out[30]:

Text(0, 0.5, 'NMF Component2')