혼자 공부하는 머신러닝+딥러닝 / Chapter 06 비지도 학습

Data Analysis/ML & DL

혼자 공부하는 머신러닝+딥러닝 / Chapter 06 비지도 학습

Bay Im 2023. 8. 13. 23:27

Chapter 06-1
군집 알고리즘

- 비지도 학습(unsupervised learning)

타깃이 없을 때 사용하는 머신러닝 알고리즘,

사람이 가르쳐 주지 않아도 데이터에 있는 무언가를 학습한다.

- 군집(clustering)

비슷한 샘플끼리 그룹으로 모으는 작업,

대표적이 비지도 학습 작업 중 하나이다.

군집 알고리즘에서 만든 그룹을 클러스터(cluster)라고 한다.

- 함수 모음

subplots(): 여러 개의 그래프를 배열처럼 쌓을 수 있는 함수

mean(): 평균값을 계산하는 메서드

abs(): 절대값을 계산하는 함수

코랩 실습 화면

import numpy as np

import matplotlib.pyplot as plt

fruits = np.load('fruits_300.npy')

print("데이터 배열 크기: ", fruits.shape)

print(fruits[0, 0, :])

# 첫번째(사과) 이미지 출력

plt.imshow(fruits[0], cmap='gray')

plt.show()

plt.imshow(fruits[0], cmap='gray_r')

plt.show()

# 바나나, 파인애플 이미지 출력

fig, axs = plt.subplots(1, 2)

axs[0].imshow(fruits[100], cmap='gray_r')

axs[1].imshow(fruits[200], cmap='gray_r')

plt.show()

# 데이터 배열 1차원으로 만들기

apple = fruits[0:100].reshape(-1, 100*100)

pineapple = fruits[100:200].reshape(-1, 100*100)

banana = fruits[200:300].reshape(-1, 100*100)

print("사과 데이터 배열 크기: ", apple.shape)

print("사과 데이터 배열의 픽셀 평균값: ", apple.mean(axis=1))

# 히스토그램으로 픽셀 평균값 확인하기

plt.hist(np.mean(apple, axis=1), alpha=0.8)

plt.hist(np.mean(pineapple, axis=1), alpha=0.8)

plt.hist(np.mean(banana, axis=1), alpha=0.8)

plt.legend(['apple', 'pineapple', 'banana'])

plt.show()

# 막대그래프로 픽셀 평균값 확인하기

fig, axs = plt.subplots(1, 3, figsize=(20, 5))

axs[0].bar(range(10000), np.mean(apple, axis=0))

axs[1].bar(range(10000), np.mean(pineapple, axis=0))

axs[2].bar(range(10000), np.mean(banana, axis=0))

plt.show()

# 모든 이미지를 합쳐 놓은 것 같은 그래프로 확인하기

apple_mean = np.mean(apple, axis=0).reshape(100, 100)

pineapple_mean = np.mean(pineapple, axis=0).reshape(100, 100)

banana_mean = np.mean(banana, axis=0).reshape(100, 100)

fig, axs = plt.subplots(1, 3, figsize=(20, 5))

axs[0].imshow(apple_mean, cmap='gray_r')

axs[1].imshow(pineapple_mean, cmap='gray_r')

axs[2].imshow(banana_mean, cmap='gray_r')

plt.show()

# 절댓값 오차 계산

abs_diff = np.abs(fruits - apple_mean)

abs_mean = np.mean(abs_diff, axis=(1, 2))

print("abs_mean의 크기: ", abs_mean.shape)

# 값이 작은 순서대로 100개 출력

apple_index = np.argsort(abs_mean)[:100]

fig, axs = plt.subplots(10, 10, figsize=(10, 10))

for i in range(10):

for j in range(10):

axs[i, j].imshow(fruits[apple_index[i*10 + j]], cmap="gray_r")

axs[i, j].axis('off')

plt.show()

Chapter 06-2
k-평균

- k-평균(k-means) 알고리즘

평균 값을 구할 때 k-평균 군집 알고리즘이 평균값을 자동으로 찾아준다.

이 평균값이 클러스터의 중심에 위치하기 때문에 클러스터 중심(cluster center) 또는 센트로이드(centroid)라고 부른다.

sklearn.cluster 모듈에 KMeans 클래스 사용

n_clusters 매개변수에 클러스터 개수 지정

기본 미션
k-평균 알고리즘 작동 방식 설명하기

k-평균 알고리즘은 먼저 주어진 데이터를 k개의 클러스터로 묶는다.

그리고 클러스터 중심을 정한다.

각 샘플에서 가장 가까운 클러스터 중심을 찾아서 클러스터의 샘플로 지정한다.

해당 샘플의 평균값으로 클러스터 중심을 변경한다.

이렇게 클러스터 중심에 변화가 없을 때 까지 반복한다.

- 엘보우(elbow) 방법

적절한 클러스터 개수를 찾기 위한 대표적인 방법

이너셔는 클러스터 중심과 샘플 사이 거리의 제곱

코랩 실습 화면

import numpy as np

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

fruits = np.load('fruits_300.npy')

fruits_2d = fruits.reshape(-1, 100*100)

# k-평균 알고리즘

km = KMeans(n_clusters=3, random_state=42)

km.fit(fruits_2d)

print("k-평균 알고리즘 레이블 값: ", km.labels_)

print("샘플의 개수: ", np.unique(km.labels_, return_counts=True))

# 클러스트 이미지 출력 함수

def draw_fruits(arr, ratio=1):

n = len(arr)

rows = int(np.ceil(n/10))

cols = n if rows < 2 else 10

fig, axs = plt.subplots(rows, cols, figsize=(cols*ratio, rows*ratio), squeeze=False)

for i in range(rows):

for j in range(cols):

if i*10 + j < n:

axs[i, j].imshow(arr[i*10 + j], cmap='gray_r')

axs[i, j].axis('off')

plt.show()

draw_fruits(fruits[km.labels_==0])

draw_fruits(fruits[km.labels_==1])

draw_fruits(fruits[km.labels_==2])

draw_fruits(km.cluster_centers_.reshape(-1, 100, 100), ratio=3)

print(km.transform(fruits_2d[100:101]))

print(km.predict(fruits_2d[100:101]))

draw_fruits(fruits[100:101])

print("k-평균 알고리즘이 반복한 횟수: ", km.n_iter_)

# 최적의 k 찾기

inertia = []

for k in range(2, 7):

km = KMeans(n_clusters=k, n_init='auto', random_state=42)

km.fit(fruits_2d)

inertia.append(km.inertia_)

plt.plot(range(2, 7), inertia)

plt.xlabel('k')

plt.ylabel('inertia')

plt.show()

Chapter 06-3
주성분 분석

- 차원 축소(dimensionality reduction)

데이터를 가장 잘 나타내는 일부 특성을 선택하여 데이터 크기를 줄이고 모델 성능 향상 시키는 방법

비지도 학습 작업 중 하나이다.

- 주성분 분석(principal component analysis), PCA

대표적인 차원 축소 알고리즘

sklean.decompositon 모듈에 PCA 클래스 사용

n_components 매개변수에 주성분 개수 지정

데이터에 있는 분산이 큰 방향을 찾는 것이다. 분산은 데이터가 널리 퍼져있는 정도.

데이터를 잘 표현하는 어떤 벡터를 찾는 것.

이 벡터를 주성분(principal component) 이라고 한다.

주성분 벡터의 원소 개수는 특성 개수와 같다.

- 설명된 분산(explained variance)

주성분이 원본 데이터의 분산을 얼마나 잘 나타내는지 기록한 값

- 함수 모음

transform(): 데이터의 차원 줄이는 메서드

inverse_transform(): 특성 복원 메서드

explained_variance_ratio_: 주성분의 설명된 분산 비율 확인

코랩 실습 화면

import numpy as np

from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import cross_validate

from sklearn.cluster import KMeans

fruits = np.load('fruits_300.npy')

fruits_2d = fruits.reshape(-1, 100*100)

# PCA 클래스

pca = PCA(n_components=50)

pca.fit(fruits_2d)

print("주성분 크기: ", pca.components_.shape)

# 클러스트 이미지 출력 함수

def draw_fruits(arr, ratio=1):

n = len(arr)

rows = int(np.ceil(n/10))

cols = n if rows < 2 else 10

fig, axs = plt.subplots(rows, cols, figsize=(cols*ratio, rows*ratio), squeeze=False)

for i in range(rows):

for j in range(cols):

if i*10 + j < n:

axs[i, j].imshow(arr[i*10 + j], cmap='gray_r')

axs[i, j].axis('off')

plt.show()

draw_fruits(pca.components_.reshape(-1, 100, 100))

# 데이터 차원 축소

print("데이터의 크기: ", fruits_2d.shape)

fruits_pca = pca.transform(fruits_2d)

print("차원 줄인 데이터의 크기: ", fruits_pca.shape)

# 데이터 특성 복원

fruits_inverse = pca.inverse_transform(fruits_pca)

print("특성 복원한 데이터의 크기: ", fruits_inverse.shape)

fruits_reconstruct = fruits_inverse.reshape(-1, 100, 100)

for start in [0, 100, 200]:

draw_fruits(fruits_reconstruct[start:start+100])

print("\n")

# 분산 비율 확인

print("분산 비율: ", np.sum(pca.explained_variance_ratio_))

plt.plot(pca.explained_variance_ratio_)

plt.show()

# 로지스틱 회귀 모델로 사진 분류

lr = LogisticRegression()

target = np.array([0]*100 + [1]*100 + [2]*100)

scores = cross_validate(lr, fruits_2d, target)

print("로지스틱 회귀 모델 교차검증 점수: ", np.mean(scores['test_score']))

print("로지스틱 회귀 모델 훈련 시간: ", np.mean(scores['fit_time']))

# PCA로 축소한 데이터 사용한 로지스틱 회귀 모델

scores = cross_validate(lr, fruits_pca, target)

print("PCA로 축소한 로지스틱 회귀 모델 교차검증 점수: ", np.mean(scores['test_score']))

print("PCA로 축소한 로지스틱 회귀 모델 훈련 시간: ", np.mean(scores['fit_time']))

# 분산의 비율 입력한 PCA 클래스

pca = PCA(n_components=0.5)

pca.fit(fruits_2d)

print("분산의 비율 입력한 PCA 클래스가 찾은 주성분의 크기: ", pca.n_components_)

fruits_pca = pca.transform(fruits_2d)

print("주성분 2개로 변환한 데이터 크기: ", fruits_pca.shape)

scores = cross_validate(lr, fruits_pca, target)

print("주성분 2개 이용한 교차검증 점수: ", np.mean(scores['test_score']))

print("주성분 2개 이용한 훈련 시간: ", np.mean(scores['fit_time']))

# k-평균 알고리즘으로 사진 분류

km = KMeans(n_clusters=3, random_state=42)

km.fit(fruits_pca)

print("k-평균 알고리즘 라벨: ", np.unique(km.labels_, return_counts=True))

for label in range(0, 3):

draw_fruits(fruits[km.labels_ == label])

print('\n')

for label in range(0, 3):

data = fruits_pca[km.labels_ == label]

plt.scatter(data[:,0], data[:,1])

plt.legend(['apple', 'banana', 'pineapple'])

plt.show()

728x90

'Data Analysis > ML & DL' 카테고리의 다른 글

혼자 공부하는 머신러닝+딥러닝 / 혼공 학습단 회고 (0)	2023.08.21
혼자 공부하는 머신러닝+딥러닝 / Chapter 07 딥러닝을 시작합니다 (0)	2023.08.20
혼자 공부하는 머신러닝+딥러닝 / Chapter 05 트리 알고리즘 (0)	2023.07.30
혼자 공부하는 머신러닝+딥러닝 / Chapter 04 다양한 분류 알고리즘 (0)	2023.07.23
혼자 공부하는 머신러닝+딥러닝 / Chapter 03 회귀 알고리즘과 모델 규제 (0)	2023.07.14

현재글혼자 공부하는 머신러닝+딥러닝 / Chapter 06 비지도 학습

Im Bay