[빅데이터분석기사 실기] 제2유형: 데이터 분석
빅데이터 분석 과정¶
- 필요 패키지 임포트(import)
- 데이터 불러오기
- 데이터 살펴보기
- 데이터 전처리
- 분석 데이터셋 준비
- 데이터 분석 수행
성능평가 및 시각화
(8. 예측 및 예측 결과 저장)
지도학습, 비지도학습의 대표적인 분석기법
지도학습 - 분류: 의사결정나무(분류), KNN, 서포트벡터머신(SVM), 로지스틱회귀분석, 랜덤포래스트, 인공신경망
지도학습 - 회귀: 선형회귀분석, 다중회귀분석, 의사결정나무(회귀)
비지도학습: 군집분석(Clustering), 연관분석(Association Analysis), 인공신경망(Neural Networks)
분류 (Classification)¶
예제) Iris 데이터셋을 이용하여 붓꽃의 종류(species) 구별하기
1. 필요 패키지 임포트¶
In [1]:
import numpy as np
import pandas as pd
import sklearn
# 학습 및 테스트 데이터셋 분리를 위한 패키지 임포트
# 분류 모델을 위한 패키지 임포트
# 성능 평가를 위한 패키지 임포트
2. 데이터 불러오기¶
In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/suetudy/BigDataAnalysisEngineer_Certification/main/Iris.csv")
3. 데이터 살펴보기¶
In [3]:
df.head()
Out[3]:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
In [4]:
df.tail()
Out[4]:
Id | SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|---|
145 | 146 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
146 | 147 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
147 | 148 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
148 | 149 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
149 | 150 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
In [5]:
df.shape
Out[5]:
(150, 6)
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 150 non-null int64 1 SepalLengthCm 150 non-null float64 2 SepalWidthCm 150 non-null float64 3 PetalLengthCm 150 non-null float64 4 PetalWidthCm 150 non-null float64 5 Species 150 non-null object dtypes: float64(4), int64(1), object(1) memory usage: 7.2+ KB
In [7]:
df.describe(include='all').T
Out[7]:
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
Id | 150.0 | NaN | NaN | NaN | 75.5 | 43.445368 | 1.0 | 38.25 | 75.5 | 112.75 | 150.0 |
SepalLengthCm | 150.0 | NaN | NaN | NaN | 5.843333 | 0.828066 | 4.3 | 5.1 | 5.8 | 6.4 | 7.9 |
SepalWidthCm | 150.0 | NaN | NaN | NaN | 3.054 | 0.433594 | 2.0 | 2.8 | 3.0 | 3.3 | 4.4 |
PetalLengthCm | 150.0 | NaN | NaN | NaN | 3.758667 | 1.76442 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 |
PetalWidthCm | 150.0 | NaN | NaN | NaN | 1.198667 | 0.763161 | 0.1 | 0.3 | 1.3 | 1.8 | 2.5 |
Species | 150 | 3 | Iris-setosa | 50 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
In [8]:
# 이상치 확인
df.isnull().sum()
Out[8]:
Id 0 SepalLengthCm 0 SepalWidthCm 0 PetalLengthCm 0 PetalWidthCm 0 Species 0 dtype: int64
In [9]:
# 타겟 값 분포 확인
df['Species'].value_counts()
Out[9]:
Species Iris-setosa 50 Iris-versicolor 50 Iris-virginica 50 Name: count, dtype: int64
4. 데이터 전처리¶
- 이상치, 결측치 처리
- 데이터인코딩, 단위환산, 자료형 변환, 정규화, 파생변수 생성 등
In [10]:
# (분류 문제의 목표변수는 레이블 인코딩을 하지 않아도 결과에 영향을 주지 않는다.)
# 텍스트로 되어 있는 species 칼럼의 데이터를 0, 1, 2로 변환
df['Species'].replace({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2}, inplace=True)
5. 분석 데이터셋 준비¶
In [11]:
# 분석 데이터셋 준비
# X는 독립변수(설명변수), y는 종속변수(목표변수)
target = 'Species'
X = df.drop([target, 'Id'], axis=1)
y = df[target]
In [12]:
# 학습 및 테스트 데이터셋 분리를 위한 패키지 임포트
from sklearn.model_selection import train_test_split
# 학습용 데이터셋과 테스트용 데이터셋으로분리 (일반적으로 8:2, 7:3)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
In [13]:
# 분리한 데이터셋 크기 확인
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(120, 4) (30, 4) (120,) (30,)
6. 데이터 분석 수행¶
In [14]:
## 의사결정나무(DecisionTreeClassifier)
# 의사결정나무 분류모델을 위한 패키지 임포트
from sklearn.tree import DecisionTreeClassifier
# DecisionTreeClassifier 객체 생성
model = DecisionTreeClassifier(random_state=0)
# 학습 수행
model.fit(X_train, y_train)
Out[14]:
DecisionTreeClassifier(random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=0)
In [15]:
# RandomForest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=3, random_state=0)
rf.fit(X_train, y_train)
Out[15]:
RandomForestClassifier(n_estimators=3, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(n_estimators=3, random_state=0)
In [16]:
# KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
Out[16]:
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
In [17]:
# LogisticRegression
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()
lg.fit(X_train, y_train)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(
Out[17]:
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In [18]:
# XGBoost
from xgboost import XGBClassifier
xgb = XGBClassifier(n_estimators=3, random_state=0)
xgb.fit(X_train, y_train)
Out[18]:
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=3, n_jobs=None, num_parallel_tree=None, objective='multi:softprob', ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=3, n_jobs=None, num_parallel_tree=None, objective='multi:softprob', ...)
In [19]:
# LightGBM
from lightgbm import LGBMClassifier
lgbm = LGBMClassifier(n_estimators=3, random_state=0)
lgbm.fit(X_train, y_train)
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000044 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 90 [LightGBM] [Info] Number of data points in the train set: 120, number of used features: 4 [LightGBM] [Info] Start training from score -1.123930 [LightGBM] [Info] Start training from score -1.176574 [LightGBM] [Info] Start training from score -1.003302 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
Out[19]:
LGBMClassifier(n_estimators=3, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LGBMClassifier(n_estimators=3, random_state=0)
7. 성능평가 및 시각화¶
In [20]:
# 학습이 완료된 model 객체에서 테스트 데이터셋으로 예측 수행
pred = model.predict(X_test)
In [21]:
# 모델 성능 - 정확도 측정
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
print(acc)
1.0
In [22]:
# 모델 성능 - 오차 행렬
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)
Out[22]:
array([[11, 0, 0], [ 0, 13, 0], [ 0, 0, 6]])
In [23]:
# 모델 성능 평가 - 평가지표 계산
from sklearn.metrics import classification_report
rpt = classification_report(y_test, pred)
print(rpt)
precision recall f1-score support 0 1.00 1.00 1.00 11 1 1.00 1.00 1.00 13 2 1.00 1.00 1.00 6 accuracy 1.00 30 macro avg 1.00 1.00 1.00 30 weighted avg 1.00 1.00 1.00 30
8. 예측 결과 저장¶
In [24]:
# 학습이 완료된 model 객체에서 테스트 데이터셋으로 예측 수행
pred = model.predict(X_test)
In [25]:
# 예측 결과 csv파일로 저장
result = pd.DataFrame({'구분예측': pred})
result.to_csv('classification_result.csv', index=False)
회귀(Regression)¶
예제) Boston 주택 가격 예측
1. 필요 패키지 임포트¶
In [26]:
import numpy as np
import pandas as pd
import sklearn
2. 데이터 불러오기¶
In [27]:
df = pd.read_csv("https://raw.githubusercontent.com/suetudy/BigDataAnalysisEngineer_Certification/main/boston.csv")
3. 데이터 살펴보기¶
In [28]:
df.head()
Out[28]:
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
In [29]:
df.tail()
Out[29]:
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
501 | 0.06263 | 0.0 | 11.93 | 0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
502 | 0.04527 | 0.0 | 11.93 | 0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
503 | 0.06076 | 0.0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
504 | 0.10959 | 0.0 | 11.93 | 0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
505 | 0.04741 | 0.0 | 11.93 | 0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
In [30]:
df.shape
Out[30]:
(506, 14)
In [31]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 506 non-null float64 1 ZN 506 non-null float64 2 INDUS 506 non-null float64 3 CHAS 506 non-null int64 4 NOX 506 non-null float64 5 RM 506 non-null float64 6 AGE 506 non-null float64 7 DIS 506 non-null float64 8 RAD 506 non-null int64 9 TAX 506 non-null float64 10 PTRATIO 506 non-null float64 11 B 506 non-null float64 12 LSTAT 506 non-null float64 13 MEDV 506 non-null float64 dtypes: float64(12), int64(2) memory usage: 55.5 KB
In [32]:
df.describe(include='all').T
Out[32]:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
CRIM | 506.0 | 3.613524 | 8.601545 | 0.00632 | 0.082045 | 0.25651 | 3.677083 | 88.9762 |
ZN | 506.0 | 11.363636 | 23.322453 | 0.00000 | 0.000000 | 0.00000 | 12.500000 | 100.0000 |
INDUS | 506.0 | 11.136779 | 6.860353 | 0.46000 | 5.190000 | 9.69000 | 18.100000 | 27.7400 |
CHAS | 506.0 | 0.069170 | 0.253994 | 0.00000 | 0.000000 | 0.00000 | 0.000000 | 1.0000 |
NOX | 506.0 | 0.554695 | 0.115878 | 0.38500 | 0.449000 | 0.53800 | 0.624000 | 0.8710 |
RM | 506.0 | 6.284634 | 0.702617 | 3.56100 | 5.885500 | 6.20850 | 6.623500 | 8.7800 |
AGE | 506.0 | 68.574901 | 28.148861 | 2.90000 | 45.025000 | 77.50000 | 94.075000 | 100.0000 |
DIS | 506.0 | 3.795043 | 2.105710 | 1.12960 | 2.100175 | 3.20745 | 5.188425 | 12.1265 |
RAD | 506.0 | 9.549407 | 8.707259 | 1.00000 | 4.000000 | 5.00000 | 24.000000 | 24.0000 |
TAX | 506.0 | 408.237154 | 168.537116 | 187.00000 | 279.000000 | 330.00000 | 666.000000 | 711.0000 |
PTRATIO | 506.0 | 18.455534 | 2.164946 | 12.60000 | 17.400000 | 19.05000 | 20.200000 | 22.0000 |
B | 506.0 | 356.674032 | 91.294864 | 0.32000 | 375.377500 | 391.44000 | 396.225000 | 396.9000 |
LSTAT | 506.0 | 12.653063 | 7.141062 | 1.73000 | 6.950000 | 11.36000 | 16.955000 | 37.9700 |
MEDV | 506.0 | 22.532806 | 9.197104 | 5.00000 | 17.025000 | 21.20000 | 25.000000 | 50.0000 |
In [33]:
df.isnull().sum()
Out[33]:
CRIM 0 ZN 0 INDUS 0 CHAS 0 NOX 0 RM 0 AGE 0 DIS 0 RAD 0 TAX 0 PTRATIO 0 B 0 LSTAT 0 MEDV 0 dtype: int64
In [34]:
# 변수들 간의 상관관계 분석
corr = df.corr(method="pearson")
print(corr)
CRIM ZN INDUS CHAS NOX RM AGE \ CRIM 1.000000 -0.200469 0.406583 -0.055892 0.420972 -0.219247 0.352734 ZN -0.200469 1.000000 -0.533828 -0.042697 -0.516604 0.311991 -0.569537 INDUS 0.406583 -0.533828 1.000000 0.062938 0.763651 -0.391676 0.644779 CHAS -0.055892 -0.042697 0.062938 1.000000 0.091203 0.091251 0.086518 NOX 0.420972 -0.516604 0.763651 0.091203 1.000000 -0.302188 0.731470 RM -0.219247 0.311991 -0.391676 0.091251 -0.302188 1.000000 -0.240265 AGE 0.352734 -0.569537 0.644779 0.086518 0.731470 -0.240265 1.000000 DIS -0.379670 0.664408 -0.708027 -0.099176 -0.769230 0.205246 -0.747881 RAD 0.625505 -0.311948 0.595129 -0.007368 0.611441 -0.209847 0.456022 TAX 0.582764 -0.314563 0.720760 -0.035587 0.668023 -0.292048 0.506456 PTRATIO 0.289946 -0.391679 0.383248 -0.121515 0.188933 -0.355501 0.261515 B -0.385064 0.175520 -0.356977 0.048788 -0.380051 0.128069 -0.273534 LSTAT 0.455621 -0.412995 0.603800 -0.053929 0.590879 -0.613808 0.602339 MEDV -0.388305 0.360445 -0.483725 0.175260 -0.427321 0.695360 -0.376955 DIS RAD TAX PTRATIO B LSTAT MEDV CRIM -0.379670 0.625505 0.582764 0.289946 -0.385064 0.455621 -0.388305 ZN 0.664408 -0.311948 -0.314563 -0.391679 0.175520 -0.412995 0.360445 INDUS -0.708027 0.595129 0.720760 0.383248 -0.356977 0.603800 -0.483725 CHAS -0.099176 -0.007368 -0.035587 -0.121515 0.048788 -0.053929 0.175260 NOX -0.769230 0.611441 0.668023 0.188933 -0.380051 0.590879 -0.427321 RM 0.205246 -0.209847 -0.292048 -0.355501 0.128069 -0.613808 0.695360 AGE -0.747881 0.456022 0.506456 0.261515 -0.273534 0.602339 -0.376955 DIS 1.000000 -0.494588 -0.534432 -0.232471 0.291512 -0.496996 0.249929 RAD -0.494588 1.000000 0.910228 0.464741 -0.444413 0.488676 -0.381626 TAX -0.534432 0.910228 1.000000 0.460853 -0.441808 0.543993 -0.468536 PTRATIO -0.232471 0.464741 0.460853 1.000000 -0.177383 0.374044 -0.507787 B 0.291512 -0.444413 -0.441808 -0.177383 1.000000 -0.366087 0.333461 LSTAT -0.496996 0.488676 0.543993 0.374044 -0.366087 1.000000 -0.737663 MEDV 0.249929 -0.381626 -0.468536 -0.507787 0.333461 -0.737663 1.000000
4. 데이터 전처리¶
In [35]:
# # 결측값이 있다면 결측값이 있는 행 전체 제거
# df = df.dropna(axis=0)
In [36]:
# # 데이터 스케일링 - 수치형 데이터만 수행
# # 수치형 데이터 분리
# X_train_num = X_train.select_dtypes(exclude='object')
# X_test_num = X_test.select_dtypes(exclude='object')
# from sklearn.preprocessing import MinMaxScaler
# # MinMax 스케일러 생성
# scaler = MinMaxScaler()
# # 스케일링
# X_train_scaled = scaler.fit_transform(X_train_num)
# X_test_scaled = scaler.fit_transform(X_test_num)
# # 데이터 프레임 설정
# df_train_num = pd.DataFrame(X_train_scaled, columns=X_train_num.columns)
# df_test_num = pd.DataFrame(X_test_scaled, columns=X_test_num.columns)
In [37]:
# # 데이터 원핫 인코딩 - 범주형 데이터만 수행
# # 범주형 데이터 분리
# X_train_obj = X_train.select_dtypes(include='object')
# X_test_obj = X_test.select_dtypes(include='object')
# # 원핫 인코딩
# df_train_obj = pd.get_dummies(X_train_obj)
# df_train_obj = pd.get_dummies(X_train_obj)
In [38]:
# # 데이터 결합
# df_train = pd.concat([df_train_num, df_train_obj], axis=1)
# df_test = pd.concat([df_test_num, df_test_obj], axis=1)
5. 분석 데이터셋 준비¶
In [39]:
target = 'MEDV'
X = df.drop(target, axis=1)
y = df[target]
In [40]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
6. 데이터 분석 수행¶
In [41]:
## 의사결정나무(DecisionTreeRegressor)
# 의사결정나무 분류모델을 위한 패키지 임포트
from sklearn.tree import DecisionTreeRegressor
# DecisionTreeRegressor 객체 생성
model = DecisionTreeRegressor(random_state=0)
# 학습 수행
model.fit(X_train, y_train)
Out[41]:
DecisionTreeRegressor(random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(random_state=0)
In [42]:
# LinearRegression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
Out[42]:
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [43]:
# RandomForestRegression
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(max_depth=3, random_state=0)
rfr.fit(X_train, y_train)
Out[43]:
RandomForestRegressor(max_depth=3, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_depth=3, random_state=0)
In [44]:
# XGBRegressor
from xgboost import XGBRegressor
xgbr = XGBRegressor(random_state=0)
xgbr.fit(X_train, y_train)
Out[44]:
XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=0, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, random_state=0, ...)
7. 성능평가 및 시각화¶
In [45]:
# 학습이 완료된 model 객체에서 테스트 데이터셋으로 예측 수행
pred = model.predict(X_test)
In [46]:
# 모델 성능 평가 - 정확도 측정(R2 Score)
from sklearn.metrics import r2_score
score = r2_score(y_test, pred)
print(score)
0.6019035496385025
In [47]:
# 모델 성능 평가 - MAE(Mean Absolute Error)
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, pred)
print(mae)
3.5107843137254893
In [48]:
# 모델 성능 평가 - MSE(Mean Squared Error)
from sklearn.metrics import mean_squared_error
mae = mean_squared_error(y_test, pred)
print(mae)
32.41637254901961
8. 예측 결과 저장¶
In [49]:
# 예측 결과 csv파일로 저장
result = pd.DataFrame({'예측': pred})
result.to_csv('regression_result.csv', index=True)
'자격증 > 빅데이터분석기사' 카테고리의 다른 글
[빅데이터분석기사 실기] train, test 데이터의 범주가 서로 다른 경우 (0) | 2024.06.15 |
---|---|
[빅데이터분석기사 실기] 총 정리 - 깃허브 (0) | 2024.05.28 |
[빅데이터분석기사 실기] 제3유형: 통계적 가설검정(2) (0) | 2024.05.28 |
[빅데이터분석기사 실기] 제3유형: 통계적 가설검정(1) (0) | 2024.05.25 |
[빅데이터분석기사 실기] 제1유형: 데이터 전처리 (0) | 2024.05.18 |