R語言數據建模流程分析

Posted on 2022-01-08 by WalkonNet

Intro

近期在整理數據分析流程，找到瞭之前寫的一篇代碼，分享給大傢。這是我上學時候做的一個項目，當時由於經驗不足產生瞭一些問題，這些問題會在之後一點一點給大傢討論，避免各位踩坑。本篇分享會帶一些講解，可能有些地方不夠清楚，歡迎留言討論。

本次除瞭分享之外也是對自己之前項目的一個復盤。還是使用R語言（畢竟是我鐘愛的語言）。Python的如果有需求之後會放別的項目。

本篇中包含瞭數據導入，清洗，可視化，特征工程，建模的代碼，大傢可以選擇需要的去參考。

項目背景

數據來自Online Shopper’s Intention 包含12,330 條數據, 10個計數型特征和8個類別型特征。使用‘Revenue’ 作為標簽進行建模。最終目的就是根據拿到的這些數據去建立一個可以預測Revenue的模型。

前期準備

首先你要下載一個R語言以及它的舒適版本R studio。怎麼下載呢，把我之前文章上的話直接粘過來哈哈

安裝R以及Rstudio
如果之前有用過R的朋友請忽略這一段。
安裝R非常簡單，直接官網下載

之後下載Rstudio，這個相當於R語言的開掛版，界面相比於R來說非常友好，輔助功能也很多，下載地址

#註意Rstudio是基於R語言的，需要下載安裝R語言後才可以安裝使用。

安裝好瞭後運行以下代碼來導入package們。

setwd("~/Desktop/STAT5003/Ass") #選擇項目存放的位置，同樣這也是你數據csv存放的位置
# install.packages("xxx") 如果之前沒有裝過以下的包，先用這句話來裝包，然後再去load
# the following packages are for the EDA part
library(GGally)
library(ggcorrplot)
library(psych)
library(ggstatsplot)
library(ggplot2)
library(grid)
# the following packages are for the Model part
library(MASS)
library(Boruta)  # Feature selection with the Boruta algorithm
library(caret)
library(MLmetrics)
library(class)
library(neuralnet)
library(e1071)
library(randomForest)
library(keras)

導入的包有些多，keras那個的安裝可以參考我之前的文章（R語言基於Keras的MLP神經網絡詳解
https://blog.csdn.net/zhaotian151/article/details/84305440 ）

數據描述

首先啊把這個數據下載到你的電腦上，然後用以下代碼導入R就可以瞭。

dataset <- read.csv("online_shoppers_intention.csv")
str(dataset)

str（）這個function可以看到你這個數據的屬性，輸出如下：

此時發現數據格式有int，number，factor等等。為瞭之後建分析和建模方便，我們先統一數據格式。

dataset$OperatingSystems <- as.factor(dataset$OperatingSystems)
dataset$Browser <- as.factor(dataset$Browser)
dataset$Region <- as.factor(dataset$Region)
dataset$TrafficType <- as.factor(dataset$TrafficType)
dataset$Weekend <- as.factor(dataset$Weekend)
dataset$Revenue <- as.factor(dataset$Revenue)
dataset$Administrative <- as.numeric(dataset$Administrative)
dataset$Informational <- as.numeric(dataset$Informational)
dataset$ProductRelated <- as.numeric(dataset$ProductRelated)
summary(dataset)

現在數據格式基本統一啦，分為factor和numeric，這方便我們之後的操作。因為R裡面的一些package（尤其是建模的package）對數據的輸入格式有要求，所以提前處理好非常重要。這可以幫助你更好的整理數據以及敲出簡潔舒爽的代碼。
記住整理好數據格式之後summary（）一下，你可以從這裡發現一些數據的小問題。比如下面的這個‘Administrative_Duration ’。

你看這min=-1就離譜，（當然這也是一個小坑）我們知道duration不可能是<0的。但這是我們的主觀思維，由於不知道這個數據在采集入數據庫的時候是怎麼定義的，所以這個-1是為啥我們不會知道原因。這也是為什麼我推薦做數據分析的時候要從頭開始跟項目，這樣你對數據瞭如指掌，而不是像現在這樣隻憑主觀思想去判斷數據對錯（雖然大部分時候你的主觀思想沒啥問題）

以下給一些數據解釋，就不翻譯瞭，看或不看都可（但你自己做項目的時候一定一定一定要仔細看）

Variables are described as follows：
Administrative : Administrative Value
Administrative_Duration : Duration in Administrative Page
Informational : Informational Value
Informational_Duration : Duration in Informational Page
ProductRelated : Product Related Value
ProductRelated_Duration : Duration in Product Related Page
BounceRates : Bounce Rates of a web page
ExitRates : Exit rate of a web page
PageValues : Page values of each web page
SpecialDay : Special days like valentine etc
Month : Month of the year
OperatingSystems : Operating system used
Browser : Browser used
Region : Region of the user
TrafficType : Traffic Type
VisitorType : Types of Visitor
Weekend : Weekend or not
Revenue : Revenue will be generated or not

數據清洗

我們在上一部分的summary已經發現瞭duration有小於0的，因此所有小於0的duration相關的，我們把它變成NA，然後算一下NA率，來判斷這些數是給它填補上還是直接刪。個人認為如果missing rate很小刪瞭就成。但如果你的數據集本身就不大，那建議你使用填值法填進去。因為數據太少的話就沒啥分析的必要。具體多少算少，見仁見智吧，感興趣的話之後可以寫一篇做討論。

dataset$Administrative_Duration[dataset$Administrative_Duration < 0] = NA
dataset$Informational_Duration[dataset$Informational_Duration < 0] = NA
dataset$ProductRelated_Duration[dataset$ProductRelated_Duration < 0] = NA
missing.rate <- 1 - nrow(na.omit(dataset))/nrow(dataset)
paste("missing rate =", missing.rate * 100, "%")

"missing rate = 0.381184103811838 %"還挺小的，所以直接刪掉有問題的數據。

dataset <- na.omit(dataset)

然後記得用summary再查一次哦，看看是否刪幹凈瞭。

預分析及預處理

數值型數據

下面三種分別是箱形圖，ggpairs以及相關性矩陣。箱形圖可以用來觀察數據整體的分佈情況。ggpairs繪制的相關關系圖可以查看數據分佈和相關性。相關性矩陣專註於看相關系數以及是否相關性是否significant。這幾個各有其註重點，根據需要去做就可以。

par(mfrow = c(2, 5)) #讓圖片以2行5列的形式排列在一張圖上
boxplot(dataset$Administrative, main = "Administrative")
boxplot(dataset$Administrative_Duration, main = "Administrative_Duration")
boxplot(dataset$Informational, main = "Informational")
boxplot(dataset$Informational_Duration, main = "Informational_Duration")
boxplot(dataset$ProductRelated, main = "ProductRelated")
boxplot(dataset$ProductRelated_Duration, main = "ProductRelated_Duration")
boxplot(dataset$BounceRates, main = "BounceRates")
boxplot(dataset$ExitRates, main = "ExitRates")
boxplot(dataset$PageValues, main = "PageValues")
boxplot(dataset$SpecialDay, main = "SpecialDay")

ggpairs(dataset[, c(1:10)])

corr = cor(dataset[, c(1:10)])
p.mat <- cor_pmat(dataset[, c(1:10)], use = "complete", method = "pearson")
ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, p.mat = p.mat, 
    insig = "blank")

類別型數據

針對類別型數據我們主要是看他的分佈，因此直接畫bar plot就成。下面的代碼用到瞭ggplot，是個非常好用的可視化包。grid.newpage()這裡主要是為瞭讓這些圖片都顯示在一張圖上，這樣把圖片導出或是直接在markdown上顯示的時候所有圖都會顯示在一個頁面上面，看起來比較美觀和舒適。

p1 <- ggplot(dataset, aes(x = SpecialDay)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p2 <- ggplot(dataset, aes(x = Month)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p3 <- ggplot(dataset, aes(x = OperatingSystems)) + geom_bar(fill = "#CF6A1A", 
    colour = "black") + theme_bw()
p4 <- ggplot(dataset, aes(x = Browser)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p5 <- ggplot(dataset, aes(x = Region)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p6 <- ggplot(dataset, aes(x = TrafficType)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p7 <- ggplot(dataset, aes(x = VisitorType)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p8 <- ggplot(dataset, aes(x = Weekend)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p9 <- ggplot(dataset, aes(x = Revenue)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
grid.newpage()
pushViewport(viewport(layout = grid.layout(4, 3, heights = unit(c(1, 3, 3, 3), 
    "null"))))
grid.text("Bar Plot of All Categorical Feature", vp = viewport(layout.pos.row = 1, 
    layout.pos.col = 1:3))
vplayout = function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
print(p1, vp = vplayout(2, 1))
print(p2, vp = vplayout(2, 2))
print(p3, vp = vplayout(2, 3))
print(p4, vp = vplayout(3, 1))
print(p5, vp = vplayout(3, 2))
print(p6, vp = vplayout(3, 3))
print(p7, vp = vplayout(4, 1))
print(p8, vp = vplayout(4, 2))
print(p9, vp = vplayout(4, 3))

我們可以看到，數據還是比較偏。我們想要預測的revenue也是非常imbalance（標簽中的false與true占比不均衡）。因此在處理數據或是選擇模型的時候要註意這一點。這裡不作詳細討論。針對imbalance data應該是有很多可以說的東西。之後有空的話可以細聊~

其實到目前為止，作為一個普通的項目來說，預分析可以結束瞭，我們查看瞭所有數據的分佈，並且對現有的數據有瞭一些直觀的印象。但我們不能滿足於此，因此對每一個類別型變量再做一次更細致的分析。

首先看一下這個 Special Day 。原數據裡給的這個special day給的是0，0.2，0.4這種數值，代表的是距離節日當天的日子，比如1就是節日當天，0.2是節日的前幾天（我記得大概是這樣）但這種就比較迷惑，我不知道這個具體是咋劃分的（這也是為啥希望大傢對你所研究的項目有非常深入的瞭解，你如果對此很瞭解，那麼很多分析的步驟是可以省略的），所以隻能讓數據告訴我，special day應該如何存在於我們之後的模型中。

special_day_check <- dataset[, c(10, 18)]
special_day_check$Revenue <- ifelse(special_day_check$Revenue == "FALSE", 0, 
    1)
special_day_check$SpecialDay[special_day_check$SpecialDay == 0] = NA
special_day_check <- na.omit(special_day_check)
special_day_glm <- glm(Revenue ~ SpecialDay, data = special_day_check, family = binomial(link = "logit"))
summary(special_day_glm)
## 
## Call:
## glm(formula = Revenue ~ SpecialDay, family = binomial(link = "logit"), 
##     data = special_day_check)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3961  -0.3756  -0.3560  -0.3374   2.4491  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3954     0.2986  -8.021 1.05e-15 ***
## SpecialDay   -0.5524     0.4764  -1.159    0.246    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 578.11  on 1247  degrees of freedom
## Residual deviance: 576.77  on 1246  degrees of freedom
## AIC: 580.77
## 
## Number of Fisher Scoring iterations: 5

首先，我們要檢查的是special day 是否應該是一個數值變量。因此，建立一個glm模型（revenue = a+b*special_day)，發現special day的p值=0.246（>0.05），因此可以數值型的認為“SpecialDay”不對revenue有顯著的影響，因此specialday可以被當作類別型變量。

現在我們把它當作類別型變量分析一下。用ggbarstats這個function。ggstatsplot是ggplot2包的擴展，主要用於創建美觀的圖片同時自動輸出統計學分析結果，其統計學分析結果包含統計分析的詳細信息，該包對於經常需要做統計分析的科研工作者來說非常有用。

ggbarstats(data = dataset, main = Revenue, condition = SpecialDay, sampling.plan = "jointMulti", 
    title = "Revenue by Special Days", xlab = "Special Days", perc.k = 0.5, 
    x.axis.orientation = "slant", ggstatsplot.layer = FALSE, messages = FALSE)

用此函數可以繪制出呈現分類變量的柱狀圖，圖中的上半部分( x P e a r s o n 2 x^2_{Pearson} xPearson2, p p p , V C r a m e r V_{Cramer} VCramer 等)代表傳統的統計學方法（Frequentist）的一些統計值，下面的部分( l o g e ( B F 01 ) log_e(BF_{01}) loge(BF01)等)代表貝葉斯（Bayesian）的一些統計值。

在本項目中，我們主要關註p-value，我們發現，p<0.001並且在柱狀圖上方所有都是***，這代表瞭非常顯著。因此我們可以確定special day就這樣作為類別型變量使用。

之後把每一個類別型變量都這樣做一下。過程不贅述瞭，挑一個有代表性的給大傢看一下。

我們看一下operating systems的ggbarstats（）。

ggbarstats(data = dataset, main = Revenue, condition = OperatingSystems, sampling.plan = "jointMulti", 
    title = "Revenue by Different Operating Systems", xlab = "Operating Systems", 
    perc.k = 0.5, x.axis.orientation = "slant", ggstatsplot.layer = FALSE, messages = FALSE)

我們發現整體的p<0.001但是，因為在子類別的樣本少，所以柱狀圖上面出現瞭ns。我們知道，如果數據很少，那麼該數據便不具有統計價值，因此我們把這些少樣本的子類別合並在一起，再看一次。

dataset$OperatingSystems <- as.integer(dataset$OperatingSystems)
dataset$OperatingSystems[dataset$OperatingSystems == "5"] <- "other"
dataset$OperatingSystems[dataset$OperatingSystems == "6"] <- "other"
dataset$OperatingSystems[dataset$OperatingSystems == "7"] <- "other"
dataset$OperatingSystems <- as.factor(dataset$OperatingSystems)
ggbarstats(data = dataset, main = Revenue, condition = OperatingSystems, sampling.plan = "jointMulti", 
    title = "Revenue by Different Operating Systems", xlab = "Operating Systems", 
    perc.k = 0.5, x.axis.orientation = "slant", ggstatsplot.layer = FALSE, messages = FALSE)

現在看起來就比較舒適瞭，都很顯著。
預處理和預分析到此結束。

特征

我們進行特征工程的最終目的就是提升模型的性能，比如你的數據特征很少的話我們需要建立一些二階、三階特征來豐富我們的數據。或是特征太多的時候我們需要進行降維處理。這裡我沒有做太多的特征工程，隻是把特征進行瞭一下基本的篩選，把沒有用的特征刪掉。這裡的邏輯是先用pca看一下可以保留多少特征，再用Boruta算法和stepAIC去選一下。

# PCA Since pca can only use on numeric data, so we use the os[,c(1:9)]
pcdata <- os[, c(1:9)]
pclable <- ifelse(os$Revenue == "TRUE", "red", "blue")
pc <- princomp(os[, c(1:9)], cor = TRUE, scores = TRUE)
summary(pc)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
## Standard deviation     1.8387377 1.2923744 1.0134790 1.0020214 0.9697619
## Proportion of Variance 0.3756618 0.1855813 0.1141266 0.1115608 0.1044931
## Cumulative Proportion  0.3756618 0.5612431 0.6753697 0.7869305 0.8914236
##                            Comp.6     Comp.7    Comp.8      Comp.9
## Standard deviation     0.65008195 0.59319914 0.3510795 0.281849096
## Proportion of Variance 0.04695628 0.03909836 0.0136952 0.008826546
## Cumulative Proportion  0.93837989 0.97747825 0.9911735 1.000000000
plot(pc, type = "lines")

從pca裡面我們可以發現，保留7個numeric變量就可以有95%以上的方差。因此之後我們可以按著至少7個numeric variable這個標準去保留。

Boruta算法

set.seed(123)
boruta.train <- Boruta(Revenue ~ ., data = os, doTrace = 2, maxRuns = 15)
print(boruta.train)
# Boruta performed 14 iterations in 3.920271 mins.  13 attributes confirmed
# important: Administrative, Administrative_Duration, BounceRates, Browser,
# ExitRates and 8 more; 1 attributes confirmed unimportant: SpecialDay; 2
# tentative attributes left: OperatingSystems, Weekend; so SpecialDay can be
# delete when we fit the model. OperatingSystems and Weekend need to check
# by other ways.

StepAIC

full.model <- glm(Revenue ~ . - SpecialDay, data = os, family = binomial(link = "logit"))

# Backward Stepwise AIC
stepback <- stepAIC(full.model, direction = "backward", steps = 3)
summary(stepback)

# OperatingSystems, Weekend are all above the <none>, combine the previous
# result by Boruta algorithm, it can be delete when we fit model.  Browser
# has the minimum AIC, it can be delete when we fit model.  PCA shows we
# should keep 7 numeric variables in the dataset when fit the model, so two
# numeric variables should be remove. Informational_Duration and
# Administrative has the minimum AIC in numeric variables, so remove these
# two variables.

綜合上面三個特征選擇的方法 SpecialDay, OperatingSystems, Weekend, Browser, Informational_Duration 和 Administrative 應當在建模的時候被移除。有興趣的可以跑一下上面的代碼，由於運行時間有點長，結果就直接碼在代碼框裡瞭。

建模

現在把用來建模數據整理好，準備建模。

os_modeldata <- os[, -c(1, 4, 10, 11, 12, 16)]
# summary(os_modeldata)
write.csv(os_modeldata, "os_modeldata.csv")

首先劃分訓練集和測試集（train 和 test）

set.seed(123)
os_modeldata <- read.csv("os_modeldata.csv")
os_modeldata <- os_modeldata[, -1]
os_modeldata$Revenue <- as.factor(os_modeldata$Revenue)
inTrain <- createDataPartition(os_modeldata$Revenue, p = 0.9)[[1]]
Train <- os_modeldata[inTrain, ]
Test <- os_modeldata[-inTrain, ]

然後把訓練集拆成train和val。這裡加瞭個10-cv。有些模型的function可以自己加cv，但由於要用到不同的建模package，為瞭避免不同package之間劃分cv的差異，咱自己建~

add_cv_cohorts <- function(dat, cv_K) {
    if (nrow(dat)%%cv_K == 0) {
        # if perfectly divisible
        dat$cv_cohort <- sample(rep(1:cv_K, each = (nrow(dat)%/%cv_K)))
    } else {
        # if not perfectly divisible
        dat$cv_cohort <- sample(c(rep(1:(nrow(dat)%%cv_K), each = (nrow(dat)%/%cv_K + 
            1)), rep((nrow(dat)%%cv_K + 1):cv_K, each = (nrow(dat)%/%cv_K))))
    }
    return(dat)
}
# add 10-fold CV labels to real estate data
train_cv <- add_cv_cohorts(Train, 10)
# str(train_cv)

首先建一個基準模型，Logistic regression classifer（benchmark model）

train_cv_glm <- train_cv
glm.acc <- glm.f1 <- c()
train_cv_glm$Revenue <- ifelse(train_cv_glm$Revenue == "TRUE", 1, 0)
# str(train_cv_glm)
for (i in 1:10) {
    # Segement my data by fold using the which() function
    indexes <- which(train_cv_glm$cv_cohort == i)
    train <- train_cv_glm[-indexes, ]
    val <- train_cv_glm[indexes, ]
    # Model
    glm.model <- glm(Revenue ~ . - cv_cohort, data = train, family = binomial(link = "logit"))
    # predict
    glm.pred <- predict(glm.model, newdata = val, type = "response")
    glm.pred <- ifelse(glm.pred > 0.5, 1, 0)
    # evaluate
    glm.f1[i] <- F1_Score(val$Revenue, glm.pred, positive = "1")
    glm.acc[i] <- sum(glm.pred == val$Revenue)/nrow(val)
}
# F1 and ACC
glm.acc.train <- round(mean(glm.acc), 5) * 100
glm.f1.train <- round(mean(glm.f1), 5) * 100
# print(glm.cm <- table(glm.pred, val$Revenue))
paste("The accuracy by Logistic regression classifier by 10-fold CV in train data is", 
    glm.acc.train, "%")
paste("The F1-score by Logistic regression classifier by 10-fold CV in train data is", 
    glm.f1.train, "%")
# f1 = 0.50331

然後建立我們用來對比的機器學習模型。這裡使用網格搜索法調參。

KNN

# since knn() function can't use factor as indenpent variable So re-coding
# data, factor to dummy variable)
train_cv_knn <- as.data.frame(model.matrix(~., train_cv[, -11]))
train_cv_knn$Revenue <- train_cv$Revenue
train_cv_knn <- train_cv_knn[, -1]
# head(train_cv_knn)
knn.grid <- expand.grid(k = c(1:30))
knn.grid$acc <- knn.grid$f1 <- NA
knn.f1 <- knn.acc <- c()
for (k in 1:nrow(knn.grid)) {
    for (i in 1:10) {
        # Segement my data by fold using the which() function
        indexes <- which(train_cv_knn$cv_cohort == i)
        train <- train_cv_knn[-indexes, ]
        val <- train_cv_knn[indexes, ]
        # model and predict
        knn.pred <- knn(train[, -c(34, 35)], val[, -c(34, 35)], train$Revenue, 
            k = k)
        # evaluate
        knn.f1[i] <- F1_Score(val$Revenue, knn.pred, positive = "TRUE")
        knn.acc[i] <- sum(knn.pred == val$Revenue)/nrow(val)
    }
    knn.grid$f1[k] <- mean(knn.f1)
    knn.grid$acc[k] <- mean(knn.acc)
    print(paste("finished with =", k))
}
print(knn.cm <- table(knn.pred, val$Revenue))
knn.grid[which.max(knn.grid$f1), ]
# k = 7, f1=0.5484112, acc=0.885042

SVM

svm.grid <- expand.grid(cost = c(0.1, 1, 10), gamma = seq(0.2, 1, 0.2))
svm.grid$acc <- svm.grid$f1 <- NA
svm.f1 <- svm.acc <- c()
for (k in 1:nrow(svm.grid)) {
    for (i in 1:10) {
        # Segement my data by fold using the which() function
        indexes <- which(train_cv$cv_cohort == i)
        train <- train_cv[-indexes, ]
        val <- train_cv[indexes, ]
        # model
        svm.model <- svm(Revenue ~ ., kernel = "radial", type = "C-classification", 
            gamma = svm.grid$gamma[k], cost = svm.grid$cost[k], data = train[, 
                -12])
        svm.pred <- predict(svm.model, val[, -12])
        # evaluate
        svm.f1[i] <- F1_Score(val$Revenue, svm.pred, positive = "TRUE")
        svm.acc[i] <- sum(svm.pred == val$Revenue)/nrow(val)
    }
    svm.grid$f1[k] <- mean(svm.f1)
    svm.grid$acc[k] <- mean(svm.acc)
    print(paste("finished with:", k))
}
print(svm.cm <- table(svm.pred, val$Revenue))
svm.grid[which.max(svm.grid$f1), ]
# cost=1, gamma=0.2,f1= 0.5900601,acc= 0.8948096

Random Forest

rf.grid <- expand.grid(nt = seq(100, 500, by = 100), mrty = c(1, 3, 5, 7, 10))
rf.grid$acc <- rf.grid$f1 <- NA
rf.f1 <- rf.acc <- c()
for (k in 1:nrow(rf.grid)) {
    for (i in 1:10) {
        # Segement my data by fold using the which() function
        indexes <- which(train_cv$cv_cohort == i)
        train <- train_cv[-indexes, ]
        val <- train_cv[indexes, ]
        # model
        rf.model <- randomForest(Revenue ~ ., data = train[, -12], n.trees = rf.grid$nt[k], 
            mtry = rf.grid$mrty[k])
        rf.pred <- predict(rf.model, val[, -12])
        # evaluate
        rf.f1[i] <- F1_Score(val$Revenue, rf.pred, positive = "TRUE")
        rf.acc[i] <- sum(rf.pred == val$Revenue)/nrow(val)
    }
    rf.grid$f1[k] <- mean(rf.f1)
    rf.grid$acc[k] <- mean(rf.acc)
    print(paste("finished with:", k))
}
print(rf.cm <- table(rf.pred, val$Revenue))
rf.grid[which.max(rf.grid$f1), ]
# nt=200,mtry=3 ,f1 = 0.6330392, acc=0.8960723

Neural Network

nndata <- Train
nndata$Revenue <- ifelse(nndata$Revenue == "TRUE", 1, 0)
train_x <- model.matrix(~., nndata[, -11])
train_x <- train_x[, -1]
train_y <- to_categorical(as.integer(as.matrix(array(nndata[, 11]))), 2)
model <- keras_model_sequential()
# defining model's layers
model %>% layer_dense(units = 30, input_shape = 33, activation = "relu") %>% 
    layer_dense(units = 40, activation = "relu") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 60, activation = "relu") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 30, activation = "relu") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 2, activation = "sigmoid")
# defining model's optimizer
model %>% compile(loss = "binary_crossentropy", optimizer = "adam", metrics = c("accuracy"))
# Metrics: The performance evaluation module provides a series of functions
# for model performance evaluation. We use it to determine when the NN
# should stop train. The ultimate measure of performance is F1.
# Check which column in train_y is FALSE
table(train_y[, 1])  # the first column is FALSE
table(train_y[, 1])[[2]]/table(train_y[, 1])[[1]]
# Define a dictionary with your labels and their associated weights
weight = list(5.5, 1)  # the proportion of FALSE and TURE is about 5.5:1
# fitting the model on the training dataset
model %>% fit(train_x, train_y, epochs = 50, validation_split = 0.2, batch_size = 512, 
    class_weight = weight)
# after epoch = 20, val_loss not descrease and val_acc not increase, so NN
# should stop at epoch = 20

模型對比

GLM

glmdata <- Train
glmdata$Revenue <- ifelse(glmdata$Revenue == "TRUE", 1, 0)
testglm <- Test
testglm$Revenue <- ifelse(testglm$Revenue == "TRUE", 1, 0)
glm.model.f <- glm(Revenue ~ ., data = glmdata, family = binomial(link = "logit"))
glm.pred.f <- predict(glm.model.f, newdata = Test, type = "response")
glm.pred.f <- ifelse(glm.pred.f > 0.5, 1, 0)
glm.f1.f <- F1_Score(testglm$Revenue, glm.pred.f, positive = "1")
paste("The F1-score by Logistic regression classifier in test data is", glm.f1.f)

KNN

knndata <- as.data.frame(model.matrix(~., Train[, -11]))
knndata <- knndata[, -1]
knntest <- as.data.frame(model.matrix(~., Test[, -11]))
knntest <- knntest[, -1]
knn.model.f.pred <- knn(knndata, knntest, Train$Revenue, k = 7)
knn.f1.f <- F1_Score(Test$Revenue, knn.model.f.pred, positive = "TRUE")
paste("The F1-score by KNN classifier in test data is", knn.f1.f)

SVM

svm.model.f <- svm(Revenue ~ ., kernel = "radial", type = "C-classification", 
    gamma = 0.2, cost = 1, data = Train)
svm.pred.f <- predict(svm.model.f, Test)
svm.f1.f <- F1_Score(Test$Revenue, svm.pred.f, positive = "TRUE")
paste("The F1-score by SVM classifier in test data is", svm.f1.f)

Random Forests

rf.model.f <- randomForest(Revenue ~ ., data = Train, n.trees = 200, mtry = 3)
rf.pred.f <- predict(rf.model.f, Test)
rf.f1.f <- F1_Score(Test$Revenue, rf.pred.f, positive = "TRUE")
paste("The F1-score by Random Forests classifier in test data is", rf.f1.f)

nndata <- Train
nndata$Revenue <- ifelse(nndata$Revenue == "TRUE", 1, 0)
train_x <- model.matrix(~., nndata[, -11])
train_x <- train_x[, -1]
train_y <- to_categorical(as.integer(as.matrix(array(nndata[, 11]))), 2)
model <- keras_model_sequential()
# defining model's layers
model %>% layer_dense(units = 30, input_shape = 33, activation = "relu") %>% 
    layer_dense(units = 40, activation = "relu") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 60, activation = "relu") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 30, activation = "relu") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 2, activation = "sigmoid")
# defining model's optimizer
model %>% compile(loss = "binary_crossentropy", optimizer = "adam", metrics = c("accuracy"))
weight = list(5.5, 1)
model %>% fit(train_x, train_y, epochs = 20, batch_size = 512, class_weight = weight)
# test data
testnn <- Test
testnn$Revenue <- ifelse(testnn$Revenue == "TRUE", 1, 0)
test_x <- model.matrix(~., testnn[, -11])
test_x <- test_x[, -1]
nn.pred <- model %>% predict(test_x)
nn.pred <- as.data.frame(nn.pred)
nn.pred$label <- NA
nn.pred$label <- ifelse(nn.pred$V2 > nn.pred$V1, "TRUE", "FALSE")
nn.pred$label <- as.factor(nn.pred$label)
nn.f1 <- F1_Score(Test$Revenue, nn.pred$label, positive = "TRUE")
paste("The F1-score by Neural network in test data is", nn.f1)

看一下結果對比哈，RF和NN的表現較好。最後做個混淆矩陣看一下。

# RF
print(rf.cm.f <- table(rf.pred.f, Test$Revenue))
##          
## rf.pred.f FALSE TRUE
##     FALSE   987   74
##     TRUE     50  116
# NN
print(nn.cm.f <- table(nn.pred$label, Test$Revenue))
##        
##         FALSE TRUE
##   FALSE   980   69
##   TRUE     57  121

到此這篇關於R語言數據建模流程分析的文章就介紹到這瞭,更多相關R語言數據建模內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet！

R語言數據建模流程分析

目錄

Intro

項目背景

前期準備

數據描述

數據清洗

預分析及預處理

數值型數據

類別型數據

特征

Boruta算法

建模

模型對比

推薦閱讀：

發佈留言取消回覆

近期文章

目錄

Intro

項目背景

前期準備

數據描述

數據清洗

預分析及預處理

數值型數據

類別型數據

特征

Boruta算法

建模

模型對比

推薦閱讀：

發佈留言 取消回覆

近期文章

標籤

發佈留言取消回覆