bbig

[빅데이터분석기사 실기] 제2유형: 데이터 분석

빅데이터 분석 과정¶

필요 패키지 임포트(import)
데이터 불러오기
데이터 살펴보기
데이터 전처리
분석 데이터셋 준비
데이터 분석 수행
성능평가 및 시각화

(8. 예측 및 예측 결과 저장)

지도학습, 비지도학습의 대표적인 분석기법

지도학습 - 분류: 의사결정나무(분류), KNN, 서포트벡터머신(SVM), 로지스틱회귀분석, 랜덤포래스트, 인공신경망
지도학습 - 회귀: 선형회귀분석, 다중회귀분석, 의사결정나무(회귀)
비지도학습: 군집분석(Clustering), 연관분석(Association Analysis), 인공신경망(Neural Networks)

분류 (Classification)¶

예제) Iris 데이터셋을 이용하여 붓꽃의 종류(species) 구별하기

1. 필요 패키지 임포트¶

In [1]:

import numpy as np
import pandas as pd
import sklearn

# 학습 및 테스트 데이터셋 분리를 위한 패키지 임포트

# 분류 모델을 위한 패키지 임포트

# 성능 평가를 위한 패키지 임포트

2. 데이터 불러오기¶

In [2]:

df = pd.read_csv("https://raw.githubusercontent.com/suetudy/BigDataAnalysisEngineer_Certification/main/Iris.csv")

3. 데이터 살펴보기¶

In [3]:

df.head()

Out[3]:

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	1	5.1	3.5	1.4	0.2	Iris-setosa
1	2	4.9	3.0	1.4	0.2	Iris-setosa
2	3	4.7	3.2	1.3	0.2	Iris-setosa
3	4	4.6	3.1	1.5	0.2	Iris-setosa
4	5	5.0	3.6	1.4	0.2	Iris-setosa

In [4]:

df.tail()

Out[4]:

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
145	146	6.7	3.0	5.2	2.3	Iris-virginica
146	147	6.3	2.5	5.0	1.9	Iris-virginica
147	148	6.5	3.0	5.2	2.0	Iris-virginica
148	149	6.2	3.4	5.4	2.3	Iris-virginica
149	150	5.9	3.0	5.1	1.8	Iris-virginica

In [5]:

df.shape

Out[5]:

(150, 6)

In [6]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB

In [7]:

df.describe(include='all').T

Out[7]:

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Id	150.0	NaN	NaN	NaN	75.5	43.445368	1.0	38.25	75.5	112.75	150.0
SepalLengthCm	150.0	NaN	NaN	NaN	5.843333	0.828066	4.3	5.1	5.8	6.4	7.9
SepalWidthCm	150.0	NaN	NaN	NaN	3.054	0.433594	2.0	2.8	3.0	3.3	4.4
PetalLengthCm	150.0	NaN	NaN	NaN	3.758667	1.76442	1.0	1.6	4.35	5.1	6.9
PetalWidthCm	150.0	NaN	NaN	NaN	1.198667	0.763161	0.1	0.3	1.3	1.8	2.5
Species	150	3	Iris-setosa	50	NaN	NaN	NaN	NaN	NaN	NaN	NaN

In [8]:

# 이상치 확인
df.isnull().sum()

Out[8]:

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

In [9]:

# 타겟 값 분포 확인
df['Species'].value_counts()

Out[9]:

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

4. 데이터 전처리¶

이상치, 결측치 처리
데이터인코딩, 단위환산, 자료형 변환, 정규화, 파생변수 생성 등

In [10]:

# (분류 문제의 목표변수는 레이블 인코딩을 하지 않아도 결과에 영향을 주지 않는다.)

# 텍스트로 되어 있는 species 칼럼의 데이터를 0, 1, 2로 변환
df['Species'].replace({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2}, inplace=True)

5. 분석 데이터셋 준비¶

In [11]:

# 분석 데이터셋 준비
# X는 독립변수(설명변수), y는 종속변수(목표변수)
target = 'Species'
X = df.drop([target, 'Id'], axis=1)
y = df[target]

In [12]:

# 학습 및 테스트 데이터셋 분리를 위한 패키지 임포트
from sklearn.model_selection import train_test_split

# 학습용 데이터셋과 테스트용 데이터셋으로분리 (일반적으로 8:2, 7:3)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [13]:

# 분리한 데이터셋 크기 확인
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(120, 4)
(30, 4)
(120,)
(30,)

6. 데이터 분석 수행¶

In [14]:

## 의사결정나무(DecisionTreeClassifier)

# 의사결정나무 분류모델을 위한 패키지 임포트
from sklearn.tree import DecisionTreeClassifier

# DecisionTreeClassifier 객체 생성
model = DecisionTreeClassifier(random_state=0)

# 학습 수행
model.fit(X_train, y_train)

Out[14]:

DecisionTreeClassifier(random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [15]:

# RandomForest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=3, random_state=0)
rf.fit(X_train, y_train)

Out[15]:

RandomForestClassifier(n_estimators=3, random_state=0)

In [16]:

# KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

Out[16]:

KNeighborsClassifier()

In [17]:

# LogisticRegression
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()
lg.fit(X_train, y_train)

/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Out[17]:

LogisticRegression()

In [18]:

# XGBoost
from xgboost import XGBClassifier
xgb = XGBClassifier(n_estimators=3, random_state=0)
xgb.fit(X_train, y_train)

Out[18]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=3, n_jobs=None,
              num_parallel_tree=None, objective='multi:softprob', ...)

In [19]:

# LightGBM
from lightgbm import LGBMClassifier
lgbm = LGBMClassifier(n_estimators=3, random_state=0)
lgbm.fit(X_train, y_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000044 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 90
[LightGBM] [Info] Number of data points in the train set: 120, number of used features: 4
[LightGBM] [Info] Start training from score -1.123930
[LightGBM] [Info] Start training from score -1.176574
[LightGBM] [Info] Start training from score -1.003302
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

Out[19]:

LGBMClassifier(n_estimators=3, random_state=0)

7. 성능평가 및 시각화¶

In [20]:

# 학습이 완료된 model 객체에서 테스트 데이터셋으로 예측 수행
pred = model.predict(X_test)

In [21]:

# 모델 성능 - 정확도 측정
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
print(acc)

1.0

In [22]:

# 모델 성능 - 오차 행렬
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)

Out[22]:

array([[11,  0,  0],
       [ 0, 13,  0],
       [ 0,  0,  6]])

In [23]:

# 모델 성능 평가 - 평가지표 계산
from sklearn.metrics import classification_report
rpt = classification_report(y_test, pred)
print(rpt)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00         6

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

8. 예측 결과 저장¶

In [24]:

# 학습이 완료된 model 객체에서 테스트 데이터셋으로 예측 수행
pred = model.predict(X_test)

In [25]:

# 예측 결과 csv파일로 저장
result = pd.DataFrame({'구분예측': pred})

result.to_csv('classification_result.csv', index=False)

회귀(Regression)¶

예제) Boston 주택 가격 예측

1. 필요 패키지 임포트¶

In [26]:

import numpy as np
import pandas as pd
import sklearn

2. 데이터 불러오기¶

In [27]:

df = pd.read_csv("https://raw.githubusercontent.com/suetudy/BigDataAnalysisEngineer_Certification/main/boston.csv")

3. 데이터 살펴보기¶

In [28]:

df.head()

Out[28]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222.0	18.7	396.90	5.33	36.2

In [29]:

df.tail()

Out[29]:

	CRIM	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
501	0.06263	11.93	0.573	6.593	69.1	2.4786	1	273.0	21.0	391.99	9.67	22.4
502	0.04527	11.93	0.573	6.120	76.7	2.2875	1	273.0	21.0	396.90	9.08	20.6
503	0.06076	11.93	0.573	6.976	91.0	2.1675	1	273.0	21.0	396.90	5.64	23.9
504	0.10959	11.93	0.573	6.794	89.3	2.3889	1	273.0	21.0	393.45	6.48	22.0
505	0.04741	11.93	0.573	6.030	80.8	2.5050	1	273.0	21.0	396.90	7.88	11.9

In [30]:

df.shape

Out[30]:

(506, 14)

In [31]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB

In [32]:

df.describe(include='all').T

Out[32]:

	count	mean	std	min	25%	50%	75%	max
CRIM	506.0	3.613524	8.601545	0.00632	0.082045	0.25651	3.677083	88.9762
ZN	506.0	11.363636	23.322453	0.00000	0.000000	0.00000	12.500000	100.0000
INDUS	506.0	11.136779	6.860353	0.46000	5.190000	9.69000	18.100000	27.7400
CHAS	506.0	0.069170	0.253994	0.00000	0.000000	0.00000	0.000000	1.0000
NOX	506.0	0.554695	0.115878	0.38500	0.449000	0.53800	0.624000	0.8710
RM	506.0	6.284634	0.702617	3.56100	5.885500	6.20850	6.623500	8.7800
AGE	506.0	68.574901	28.148861	2.90000	45.025000	77.50000	94.075000	100.0000
DIS	506.0	3.795043	2.105710	1.12960	2.100175	3.20745	5.188425	12.1265
RAD	506.0	9.549407	8.707259	1.00000	4.000000	5.00000	24.000000	24.0000
TAX	506.0	408.237154	168.537116	187.00000	279.000000	330.00000	666.000000	711.0000
PTRATIO	506.0	18.455534	2.164946	12.60000	17.400000	19.05000	20.200000	22.0000
B	506.0	356.674032	91.294864	0.32000	375.377500	391.44000	396.225000	396.9000
LSTAT	506.0	12.653063	7.141062	1.73000	6.950000	11.36000	16.955000	37.9700
MEDV	506.0	22.532806	9.197104	5.00000	17.025000	21.20000	25.000000	50.0000

In [33]:

df.isnull().sum()

Out[33]:

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [34]:

# 변수들 간의 상관관계 분석
corr = df.corr(method="pearson")
print(corr)

             CRIM        ZN     INDUS      CHAS       NOX        RM       AGE  \
CRIM     1.000000 -0.200469  0.406583 -0.055892  0.420972 -0.219247  0.352734   
ZN      -0.200469  1.000000 -0.533828 -0.042697 -0.516604  0.311991 -0.569537   
INDUS    0.406583 -0.533828  1.000000  0.062938  0.763651 -0.391676  0.644779   
CHAS    -0.055892 -0.042697  0.062938  1.000000  0.091203  0.091251  0.086518   
NOX      0.420972 -0.516604  0.763651  0.091203  1.000000 -0.302188  0.731470   
RM      -0.219247  0.311991 -0.391676  0.091251 -0.302188  1.000000 -0.240265   
AGE      0.352734 -0.569537  0.644779  0.086518  0.731470 -0.240265  1.000000   
DIS     -0.379670  0.664408 -0.708027 -0.099176 -0.769230  0.205246 -0.747881   
RAD      0.625505 -0.311948  0.595129 -0.007368  0.611441 -0.209847  0.456022   
TAX      0.582764 -0.314563  0.720760 -0.035587  0.668023 -0.292048  0.506456   
PTRATIO  0.289946 -0.391679  0.383248 -0.121515  0.188933 -0.355501  0.261515   
B       -0.385064  0.175520 -0.356977  0.048788 -0.380051  0.128069 -0.273534   
LSTAT    0.455621 -0.412995  0.603800 -0.053929  0.590879 -0.613808  0.602339   
MEDV    -0.388305  0.360445 -0.483725  0.175260 -0.427321  0.695360 -0.376955   

              DIS       RAD       TAX   PTRATIO         B     LSTAT      MEDV  
CRIM    -0.379670  0.625505  0.582764  0.289946 -0.385064  0.455621 -0.388305  
ZN       0.664408 -0.311948 -0.314563 -0.391679  0.175520 -0.412995  0.360445  
INDUS   -0.708027  0.595129  0.720760  0.383248 -0.356977  0.603800 -0.483725  
CHAS    -0.099176 -0.007368 -0.035587 -0.121515  0.048788 -0.053929  0.175260  
NOX     -0.769230  0.611441  0.668023  0.188933 -0.380051  0.590879 -0.427321  
RM       0.205246 -0.209847 -0.292048 -0.355501  0.128069 -0.613808  0.695360  
AGE     -0.747881  0.456022  0.506456  0.261515 -0.273534  0.602339 -0.376955  
DIS      1.000000 -0.494588 -0.534432 -0.232471  0.291512 -0.496996  0.249929  
RAD     -0.494588  1.000000  0.910228  0.464741 -0.444413  0.488676 -0.381626  
TAX     -0.534432  0.910228  1.000000  0.460853 -0.441808  0.543993 -0.468536  
PTRATIO -0.232471  0.464741  0.460853  1.000000 -0.177383  0.374044 -0.507787  
B        0.291512 -0.444413 -0.441808 -0.177383  1.000000 -0.366087  0.333461  
LSTAT   -0.496996  0.488676  0.543993  0.374044 -0.366087  1.000000 -0.737663  
MEDV     0.249929 -0.381626 -0.468536 -0.507787  0.333461 -0.737663  1.000000

4. 데이터 전처리¶

In [35]:

# # 결측값이 있다면 결측값이 있는 행 전체 제거
# df = df.dropna(axis=0)

In [36]:

# # 데이터 스케일링 - 수치형 데이터만 수행

# # 수치형 데이터 분리
# X_train_num = X_train.select_dtypes(exclude='object')
# X_test_num = X_test.select_dtypes(exclude='object')

# from sklearn.preprocessing import MinMaxScaler

# # MinMax 스케일러 생성
# scaler = MinMaxScaler()

# # 스케일링
# X_train_scaled = scaler.fit_transform(X_train_num)
# X_test_scaled = scaler.fit_transform(X_test_num)

# # 데이터 프레임 설정
# df_train_num = pd.DataFrame(X_train_scaled, columns=X_train_num.columns)
# df_test_num = pd.DataFrame(X_test_scaled, columns=X_test_num.columns)

In [37]:

# # 데이터 원핫 인코딩 - 범주형 데이터만 수행

# # 범주형 데이터 분리
# X_train_obj = X_train.select_dtypes(include='object')
# X_test_obj = X_test.select_dtypes(include='object')

# # 원핫 인코딩
# df_train_obj = pd.get_dummies(X_train_obj)
# df_train_obj = pd.get_dummies(X_train_obj)

In [38]:

# # 데이터 결합
# df_train = pd.concat([df_train_num, df_train_obj], axis=1)
# df_test = pd.concat([df_test_num, df_test_obj], axis=1)

5. 분석 데이터셋 준비¶

In [39]:

target = 'MEDV'
X = df.drop(target, axis=1)
y = df[target]

In [40]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

6. 데이터 분석 수행¶

In [41]:

## 의사결정나무(DecisionTreeRegressor)

# 의사결정나무 분류모델을 위한 패키지 임포트
from sklearn.tree import DecisionTreeRegressor

# DecisionTreeRegressor 객체 생성
model = DecisionTreeRegressor(random_state=0)

# 학습 수행
model.fit(X_train, y_train)

Out[41]:

DecisionTreeRegressor(random_state=0)

In [42]:

# LinearRegression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)

Out[42]:

LinearRegression()

In [43]:

# RandomForestRegression
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(max_depth=3, random_state=0)
rfr.fit(X_train, y_train)

Out[43]:

RandomForestRegressor(max_depth=3, random_state=0)

In [44]:

# XGBRegressor
from xgboost import XGBRegressor
xgbr = XGBRegressor(random_state=0)
xgbr.fit(X_train, y_train)

Out[44]:

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=0, ...)

7. 성능평가 및 시각화¶

In [45]:

# 학습이 완료된 model 객체에서 테스트 데이터셋으로 예측 수행
pred = model.predict(X_test)

In [46]:

# 모델 성능 평가 - 정확도 측정(R2 Score)
from sklearn.metrics import r2_score
score = r2_score(y_test, pred)
print(score)

0.6019035496385025

In [47]:

# 모델 성능 평가 - MAE(Mean Absolute Error)
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, pred)
print(mae)

3.5107843137254893

In [48]:

# 모델 성능 평가 - MSE(Mean Squared Error)
from sklearn.metrics import mean_squared_error
mae = mean_squared_error(y_test, pred)
print(mae)

32.41637254901961

8. 예측 결과 저장¶

In [49]:

# 예측 결과 csv파일로 저장
result = pd.DataFrame({'예측': pred})

result.to_csv('regression_result.csv', index=True)

'자격증 > 빅데이터분석기사' 카테고리의 다른 글

[빅데이터분석기사 실기] train, test 데이터의 범주가 서로 다른 경우 (0)	2024.06.15
[빅데이터분석기사 실기] 총 정리 - 깃허브 (0)	2024.05.28
[빅데이터분석기사 실기] 제3유형: 통계적 가설검정(2) (0)	2024.05.28
[빅데이터분석기사 실기] 제3유형: 통계적 가설검정(1) (0)	2024.05.25
[빅데이터분석기사 실기] 제1유형: 데이터 전처리 (0)	2024.05.18

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

빅데이터 분석 과정¶

분류 (Classification)¶

1. 필요 패키지 임포트¶

2. 데이터 불러오기¶

3. 데이터 살펴보기¶

4. 데이터 전처리¶

5. 분석 데이터셋 준비¶

6. 데이터 분석 수행¶

7. 성능평가 및 시각화¶

8. 예측 결과 저장¶

회귀(Regression)¶

1. 필요 패키지 임포트¶

2. 데이터 불러오기¶

3. 데이터 살펴보기¶

4. 데이터 전처리¶

5. 분석 데이터셋 준비¶

6. 데이터 분석 수행¶

7. 성능평가 및 시각화¶

8. 예측 결과 저장¶

'자격증 > 빅데이터분석기사' 카테고리의 다른 글

빅데이터 분석 과정¶

분류 (Classification)¶

1. 필요 패키지 임포트¶

2. 데이터 불러오기¶

3. 데이터 살펴보기¶

4. 데이터 전처리¶

5. 분석 데이터셋 준비¶

6. 데이터 분석 수행¶

7. 성능평가 및 시각화¶

8. 예측 결과 저장¶

회귀(Regression)¶

1. 필요 패키지 임포트¶

2. 데이터 불러오기¶

3. 데이터 살펴보기¶

4. 데이터 전처리¶

5. 분석 데이터셋 준비¶

6. 데이터 분석 수행¶

7. 성능평가 및 시각화¶

8. 예측 결과 저장¶

'자격증 > 빅데이터분석기사' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역