R語言數據建模流程分析

Intro

近期在整理數據分析流程,找到瞭之前寫的一篇代碼,分享給大傢。這是我上學時候做的一個項目,當時由於經驗不足產生瞭一些問題,這些問題會在之後一點一點給大傢討論,避免各位踩坑。本篇分享會帶一些講解,可能有些地方不夠清楚,歡迎留言討論。

本次除瞭分享之外也是對自己之前項目的一個復盤。還是使用R語言(畢竟是我鐘愛的語言)。Python的如果有需求之後會放別的項目。

本篇中包含瞭數據導入,清洗,可視化,特征工程,建模的代碼,大傢可以選擇需要的去參考。

項目背景

數據來自Online Shopper’s Intention 包含12,330 條數據, 10個計數型特征和8個類別型特征。 使用‘Revenue’ 作為標簽進行建模。最終目的就是根據拿到的這些數據去建立一個可以預測Revenue的模型。

前期準備

首先你要下載一個R語言以及它的舒適版本R studio。怎麼下載呢,把我之前文章上的話直接粘過來哈哈

安裝R以及Rstudio
如果之前有用過R的朋友請忽略這一段。
安裝R非常簡單,直接官網下載

之後下載Rstudio,這個相當於R語言的開掛版,界面相比於R來說非常友好,輔助功能也很多,下載地址

#註意Rstudio是基於R語言的,需要下載安裝R語言後才可以安裝使用。

安裝好瞭後運行以下代碼來導入package們。

setwd("~/Desktop/STAT5003/Ass") #選擇項目存放的位置,同樣這也是你數據csv存放的位置
# install.packages("xxx") 如果之前沒有裝過以下的包,先用這句話來裝包,然後再去load
# the following packages are for the EDA part
library(GGally)
library(ggcorrplot)
library(psych)
library(ggstatsplot)
library(ggplot2)
library(grid)
# the following packages are for the Model part
library(MASS)
library(Boruta)  # Feature selection with the Boruta algorithm
library(caret)
library(MLmetrics)
library(class)
library(neuralnet)
library(e1071)
library(randomForest)
library(keras)

導入的包有些多,keras那個的安裝可以參考我之前的文章 (R語言基於Keras的MLP神經網絡詳解
https://blog.csdn.net/zhaotian151/article/details/84305440 )

數據描述

首先啊把這個數據下載到你的電腦上,然後用以下代碼導入R就可以瞭。

dataset <- read.csv("online_shoppers_intention.csv")
str(dataset)

str()這個function可以看到你這個數據的屬性,輸出如下:

此時發現數據格式有int,number,factor等等。為瞭之後建分析和建模方便,我們先統一數據格式。

dataset$OperatingSystems <- as.factor(dataset$OperatingSystems)
dataset$Browser <- as.factor(dataset$Browser)
dataset$Region <- as.factor(dataset$Region)
dataset$TrafficType <- as.factor(dataset$TrafficType)
dataset$Weekend <- as.factor(dataset$Weekend)
dataset$Revenue <- as.factor(dataset$Revenue)
dataset$Administrative <- as.numeric(dataset$Administrative)
dataset$Informational <- as.numeric(dataset$Informational)
dataset$ProductRelated <- as.numeric(dataset$ProductRelated)
summary(dataset)

現在數據格式基本統一啦,分為factor和numeric,這方便我們之後的操作。因為R裡面的一些package(尤其是建模的package)對數據的輸入格式有要求,所以提前處理好非常重要。這可以幫助你更好的整理數據以及敲出簡潔舒爽的代碼。
記住整理好數據格式之後summary()一下,你可以從這裡發現一些數據的小問題。比如下面的這個‘Administrative_Duration ’。

你看這min=-1就離譜,(當然這也是一個小坑)我們知道duration不可能是<0的。但這是我們的主觀思維,由於不知道這個數據在采集入數據庫的時候是怎麼定義的,所以這個-1是為啥我們不會知道原因。這也是為什麼我推薦做數據分析的時候要從頭開始跟項目,這樣你對數據瞭如指掌,而不是像現在這樣隻憑主觀思想去判斷數據對錯(雖然大部分時候你的主觀思想沒啥問題)

以下給一些數據解釋,就不翻譯瞭,看或不看都可(但你自己做項目的時候一定一定一定要仔細看)

Variables are described as follows:
Administrative : Administrative Value
Administrative_Duration : Duration in Administrative Page
Informational : Informational Value
Informational_Duration : Duration in Informational Page
ProductRelated : Product Related Value
ProductRelated_Duration : Duration in Product Related Page
BounceRates : Bounce Rates of a web page
ExitRates : Exit rate of a web page
PageValues : Page values of each web page
SpecialDay : Special days like valentine etc
Month : Month of the year
OperatingSystems : Operating system used
Browser : Browser used
Region : Region of the user
TrafficType : Traffic Type
VisitorType : Types of Visitor
Weekend : Weekend or not
Revenue : Revenue will be generated or not

數據清洗

我們在上一部分的summary已經發現瞭duration有小於0的,因此所有小於0的duration相關的,我們把它變成NA,然後算一下NA率,來判斷這些數是給它填補上還是直接刪。個人認為如果missing rate很小刪瞭就成。但如果你的數據集本身就不大,那建議你使用填值法填進去。因為數據太少的話就沒啥分析的必要。具體多少算少,見仁見智吧,感興趣的話之後可以寫一篇做討論。

dataset$Administrative_Duration[dataset$Administrative_Duration < 0] = NA
dataset$Informational_Duration[dataset$Informational_Duration < 0] = NA
dataset$ProductRelated_Duration[dataset$ProductRelated_Duration < 0] = NA
missing.rate <- 1 - nrow(na.omit(dataset))/nrow(dataset)
paste("missing rate =", missing.rate * 100, "%")

"missing rate = 0.381184103811838 %"還挺小的,所以直接刪掉有問題的數據。

dataset <- na.omit(dataset)

然後記得用summary再查一次哦,看看是否刪幹凈瞭。

預分析及預處理

數值型數據

下面三種分別是箱形圖,ggpairs以及相關性矩陣。 箱形圖可以用來觀察數據整體的分佈情況。ggpairs繪制的相關關系圖可以查看數據分佈和相關性。相關性矩陣專註於看相關系數以及是否相關性是否significant。這幾個各有其註重點,根據需要去做就可以。

par(mfrow = c(2, 5)) #讓圖片以2行5列的形式排列在一張圖上
boxplot(dataset$Administrative, main = "Administrative")
boxplot(dataset$Administrative_Duration, main = "Administrative_Duration")
boxplot(dataset$Informational, main = "Informational")
boxplot(dataset$Informational_Duration, main = "Informational_Duration")
boxplot(dataset$ProductRelated, main = "ProductRelated")
boxplot(dataset$ProductRelated_Duration, main = "ProductRelated_Duration")
boxplot(dataset$BounceRates, main = "BounceRates")
boxplot(dataset$ExitRates, main = "ExitRates")
boxplot(dataset$PageValues, main = "PageValues")
boxplot(dataset$SpecialDay, main = "SpecialDay")

ggpairs(dataset[, c(1:10)])

corr = cor(dataset[, c(1:10)])
p.mat <- cor_pmat(dataset[, c(1:10)], use = "complete", method = "pearson")
ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, p.mat = p.mat, 
    insig = "blank")

類別型數據

針對類別型數據我們主要是看他的分佈,因此直接畫bar plot就成。下面的代碼用到瞭ggplot,是個非常好用的可視化包。grid.newpage()這裡主要是為瞭讓這些圖片都顯示在一張圖上,這樣把圖片導出或是直接在markdown上顯示的時候所有圖都會顯示在一個頁面上面,看起來比較美觀和舒適。

p1 <- ggplot(dataset, aes(x = SpecialDay)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p2 <- ggplot(dataset, aes(x = Month)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p3 <- ggplot(dataset, aes(x = OperatingSystems)) + geom_bar(fill = "#CF6A1A", 
    colour = "black") + theme_bw()
p4 <- ggplot(dataset, aes(x = Browser)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p5 <- ggplot(dataset, aes(x = Region)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p6 <- ggplot(dataset, aes(x = TrafficType)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p7 <- ggplot(dataset, aes(x = VisitorType)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p8 <- ggplot(dataset, aes(x = Weekend)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
p9 <- ggplot(dataset, aes(x = Revenue)) + geom_bar(fill = "#CF6A1A", colour = "black") + 
    theme_bw()
grid.newpage()
pushViewport(viewport(layout = grid.layout(4, 3, heights = unit(c(1, 3, 3, 3), 
    "null"))))
grid.text("Bar Plot of All Categorical Feature", vp = viewport(layout.pos.row = 1, 
    layout.pos.col = 1:3))
vplayout = function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
print(p1, vp = vplayout(2, 1))
print(p2, vp = vplayout(2, 2))
print(p3, vp = vplayout(2, 3))
print(p4, vp = vplayout(3, 1))
print(p5, vp = vplayout(3, 2))
print(p6, vp = vplayout(3, 3))
print(p7, vp = vplayout(4, 1))
print(p8, vp = vplayout(4, 2))
print(p9, vp = vplayout(4, 3))

我們可以看到,數據還是比較偏。我們想要預測的revenue也是非常imbalance(標簽中的false與true占比不均衡)。因此在處理數據或是選擇模型的時候要註意這一點。這裡不作詳細討論。針對imbalance data應該是有很多可以說的東西。之後有空的話可以細聊~

其實到目前為止,作為一個普通的項目來說,預分析可以結束瞭,我們查看瞭所有數據的分佈,並且對現有的數據有瞭一些直觀的印象。但我們不能滿足於此,因此對每一個類別型變量再做一次更細致的分析。

首先看一下這個 Special Day 。原數據裡給的這個special day給的是0,0.2,0.4這種數值,代表的是距離節日當天的日子,比如1就是節日當天,0.2是節日的前幾天(我記得大概是這樣)但這種就比較迷惑,我不知道這個具體是咋劃分的(這也是為啥希望大傢對你所研究的項目有非常深入的瞭解,你如果對此很瞭解,那麼很多分析的步驟是可以省略的),所以隻能讓數據告訴我,special day應該如何存在於我們之後的模型中。

special_day_check <- dataset[, c(10, 18)]
special_day_check$Revenue <- ifelse(special_day_check$Revenue == "FALSE", 0, 
    1)
special_day_check$SpecialDay[special_day_check$SpecialDay == 0] = NA
special_day_check <- na.omit(special_day_check)
special_day_glm <- glm(Revenue ~ SpecialDay, data = special_day_check, family = binomial(link = "logit"))
summary(special_day_glm)
## 
## Call:
## glm(formula = Revenue ~ SpecialDay, family = binomial(link = "logit"), 
##     data = special_day_check)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3961  -0.3756  -0.3560  -0.3374   2.4491  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3954     0.2986  -8.021 1.05e-15 ***
## SpecialDay   -0.5524     0.4764  -1.159    0.246    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 578.11  on 1247  degrees of freedom
## Residual deviance: 576.77  on 1246  degrees of freedom
## AIC: 580.77
## 
## Number of Fisher Scoring iterations: 5

首先,我們要檢查的是special day 是否應該是一個數值變量。因此,建立一個glm模型(revenue = a+b*special_day),發現special day的p值=0.246(>0.05),因此可以數值型的認為“SpecialDay”不對revenue有顯著的影響,因此specialday可以被當作類別型變量。

現在我們把它當作類別型變量分析一下。用ggbarstats這個function。ggstatsplot是ggplot2包的擴展,主要用於創建美觀的圖片同時自動輸出統計學分析結果,其統計學分析結果包含統計分析的詳細信息,該包對於經常需要做統計分析的科研工作者來說非常有用。

ggbarstats(data = dataset, main = Revenue, condition = SpecialDay, sampling.plan = "jointMulti", 
    title = "Revenue by Special Days", xlab = "Special Days", perc.k = 0.5, 
    x.axis.orientation = "slant", ggstatsplot.layer = FALSE, messages = FALSE)

用此函數可以繪制出呈現分類變量的柱狀圖,圖中的上半部分( x P e a r s o n 2 x^2_{Pearson} xPearson2​, p p p , V C r a m e r V_{Cramer} VCramer​ 等)代表傳統的統計學方法(Frequentist)的一些統計值,下面的部分( l o g e ( B F 01 ) log_e(BF_{01}) loge​(BF01​)等)代表貝葉斯(Bayesian)的一些統計值。

在本項目中,我們主要關註p-value,我們發現,p<0.001並且在柱狀圖上方所有都是***,這代表瞭非常顯著。因此我們可以確定special day就這樣作為類別型變量使用。

之後把每一個類別型變量都這樣做一下。過程不贅述瞭,挑一個有代表性的給大傢看一下。

我們看一下operating systems的ggbarstats()。

ggbarstats(data = dataset, main = Revenue, condition = OperatingSystems, sampling.plan = "jointMulti", 
    title = "Revenue by Different Operating Systems", xlab = "Operating Systems", 
    perc.k = 0.5, x.axis.orientation = "slant", ggstatsplot.layer = FALSE, messages = FALSE)

我們發現整體的p<0.001但是,因為在子類別的樣本少,所以柱狀圖上面出現瞭ns。我們知道,如果數據很少,那麼該數據便不具有統計價值,因此我們把這些少樣本的子類別合並在一起,再看一次。

dataset$OperatingSystems <- as.integer(dataset$OperatingSystems)
dataset$OperatingSystems[dataset$OperatingSystems == "5"] <- "other"
dataset$OperatingSystems[dataset$OperatingSystems == "6"] <- "other"
dataset$OperatingSystems[dataset$OperatingSystems == "7"] <- "other"
dataset$OperatingSystems <- as.factor(dataset$OperatingSystems)
ggbarstats(data = dataset, main = Revenue, condition = OperatingSystems, sampling.plan = "jointMulti", 
    title = "Revenue by Different Operating Systems", xlab = "Operating Systems", 
    perc.k = 0.5, x.axis.orientation = "slant", ggstatsplot.layer = FALSE, messages = FALSE)

現在看起來就比較舒適瞭,都很顯著。
預處理和預分析到此結束。

特征

我們進行特征工程的最終目的就是提升模型的性能,比如你的數據特征很少的話我們需要建立一些二階、三階特征來豐富我們的數據。或是特征太多的時候我們需要進行降維處理。這裡我沒有做太多的特征工程,隻是把特征進行瞭一下基本的篩選,把沒有用的特征刪掉。這裡的邏輯是先用pca看一下可以保留多少特征,再用Boruta算法和stepAIC去選一下。

# PCA Since pca can only use on numeric data, so we use the os[,c(1:9)]
pcdata <- os[, c(1:9)]
pclable <- ifelse(os$Revenue == "TRUE", "red", "blue")
pc <- princomp(os[, c(1:9)], cor = TRUE, scores = TRUE)
summary(pc)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
## Standard deviation     1.8387377 1.2923744 1.0134790 1.0020214 0.9697619
## Proportion of Variance 0.3756618 0.1855813 0.1141266 0.1115608 0.1044931
## Cumulative Proportion  0.3756618 0.5612431 0.6753697 0.7869305 0.8914236
##                            Comp.6     Comp.7    Comp.8      Comp.9
## Standard deviation     0.65008195 0.59319914 0.3510795 0.281849096
## Proportion of Variance 0.04695628 0.03909836 0.0136952 0.008826546
## Cumulative Proportion  0.93837989 0.97747825 0.9911735 1.000000000
plot(pc, type = "lines")

從pca裡面我們可以發現,保留7個numeric變量就可以有95%以上的方差。因此之後我們可以按著至少7個numeric variable這個標準去保留。

Boruta算法

set.seed(123)
boruta.train <- Boruta(Revenue ~ ., data = os, doTrace = 2, maxRuns = 15)
print(boruta.train)
# Boruta performed 14 iterations in 3.920271 mins.  13 attributes confirmed
# important: Administrative, Administrative_Duration, BounceRates, Browser,
# ExitRates and 8 more; 1 attributes confirmed unimportant: SpecialDay; 2
# tentative attributes left: OperatingSystems, Weekend; so SpecialDay can be
# delete when we fit the model. OperatingSystems and Weekend need to check
# by other ways.

StepAIC

full.model <- glm(Revenue ~ . - SpecialDay, data = os, family = binomial(link = "logit"))

# Backward Stepwise AIC
stepback <- stepAIC(full.model, direction = "backward", steps = 3)
summary(stepback)

# OperatingSystems, Weekend are all above the <none>, combine the previous
# result by Boruta algorithm, it can be delete when we fit model.  Browser
# has the minimum AIC, it can be delete when we fit model.  PCA shows we
# should keep 7 numeric variables in the dataset when fit the model, so two
# numeric variables should be remove. Informational_Duration and
# Administrative has the minimum AIC in numeric variables, so remove these
# two variables.

綜合上面三個特征選擇的方法 SpecialDay, OperatingSystems, Weekend, Browser, Informational_Duration 和 Administrative 應當在建模的時候被移除。有興趣的可以跑一下上面的代碼,由於運行時間有點長,結果就直接碼在代碼框裡瞭。

建模

現在把用來建模數據整理好,準備建模。

os_modeldata <- os[, -c(1, 4, 10, 11, 12, 16)]
# summary(os_modeldata)
write.csv(os_modeldata, "os_modeldata.csv")

首先劃分訓練集和測試集(train 和 test)

set.seed(123)
os_modeldata <- read.csv("os_modeldata.csv")
os_modeldata <- os_modeldata[, -1]
os_modeldata$Revenue <- as.factor(os_modeldata$Revenue)
inTrain <- createDataPartition(os_modeldata$Revenue, p = 0.9)[[1]]
Train <- os_modeldata[inTrain, ]
Test <- os_modeldata[-inTrain, ]

然後把訓練集拆成train和val。這裡加瞭個10-cv。有些模型的function可以自己加cv,但由於要用到不同的建模package,為瞭避免不同package之間劃分cv的差異,咱自己建~

add_cv_cohorts <- function(dat, cv_K) {
    if (nrow(dat)%%cv_K == 0) {
        # if perfectly divisible
        dat$cv_cohort <- sample(rep(1:cv_K, each = (nrow(dat)%/%cv_K)))
    } else {
        # if not perfectly divisible
        dat$cv_cohort <- sample(c(rep(1:(nrow(dat)%%cv_K), each = (nrow(dat)%/%cv_K + 
            1)), rep((nrow(dat)%%cv_K + 1):cv_K, each = (nrow(dat)%/%cv_K))))
    }
    return(dat)
}
# add 10-fold CV labels to real estate data
train_cv <- add_cv_cohorts(Train, 10)
# str(train_cv)

首先建一個基準模型,Logistic regression classifer(benchmark model)

train_cv_glm <- train_cv
glm.acc <- glm.f1 <- c()
train_cv_glm$Revenue <- ifelse(train_cv_glm$Revenue == "TRUE", 1, 0)
# str(train_cv_glm)
for (i in 1:10) {
    # Segement my data by fold using the which() function
    indexes <- which(train_cv_glm$cv_cohort == i)
    train <- train_cv_glm[-indexes, ]
    val <- train_cv_glm[indexes, ]
    # Model
    glm.model <- glm(Revenue ~ . - cv_cohort, data = train, family = binomial(link = "logit"))
    # predict
    glm.pred <- predict(glm.model, newdata = val, type = "response")
    glm.pred <- ifelse(glm.pred > 0.5, 1, 0)
    # evaluate
    glm.f1[i] <- F1_Score(val$Revenue, glm.pred, positive = "1")
    glm.acc[i] <- sum(glm.pred == val$Revenue)/nrow(val)
}
# F1 and ACC
glm.acc.train <- round(mean(glm.acc), 5) * 100
glm.f1.train <- round(mean(glm.f1), 5) * 100
# print(glm.cm <- table(glm.pred, val$Revenue))
paste("The accuracy by Logistic regression classifier by 10-fold CV in train data is", 
    glm.acc.train, "%")
paste("The F1-score by Logistic regression classifier by 10-fold CV in train data is", 
    glm.f1.train, "%")
# f1 = 0.50331

然後建立我們用來對比的機器學習模型。這裡使用網格搜索法調參。

KNN

# since knn() function can't use factor as indenpent variable So re-coding
# data, factor to dummy variable)
train_cv_knn <- as.data.frame(model.matrix(~., train_cv[, -11]))
train_cv_knn$Revenue <- train_cv$Revenue
train_cv_knn <- train_cv_knn[, -1]
# head(train_cv_knn)
knn.grid <- expand.grid(k = c(1:30))
knn.grid$acc <- knn.grid$f1 <- NA
knn.f1 <- knn.acc <- c()
for (k in 1:nrow(knn.grid)) {
    for (i in 1:10) {
        # Segement my data by fold using the which() function
        indexes <- which(train_cv_knn$cv_cohort == i)
        train <- train_cv_knn[-indexes, ]
        val <- train_cv_knn[indexes, ]
        # model and predict
        knn.pred <- knn(train[, -c(34, 35)], val[, -c(34, 35)], train$Revenue, 
            k = k)
        # evaluate
        knn.f1[i] <- F1_Score(val$Revenue, knn.pred, positive = "TRUE")
        knn.acc[i] <- sum(knn.pred == val$Revenue)/nrow(val)
    }
    knn.grid$f1[k] <- mean(knn.f1)
    knn.grid$acc[k] <- mean(knn.acc)
    print(paste("finished with =", k))
}
print(knn.cm <- table(knn.pred, val$Revenue))
knn.grid[which.max(knn.grid$f1), ]
# k = 7, f1=0.5484112, acc=0.885042

SVM

svm.grid <- expand.grid(cost = c(0.1, 1, 10), gamma = seq(0.2, 1, 0.2))
svm.grid$acc <- svm.grid$f1 <- NA
svm.f1 <- svm.acc <- c()
for (k in 1:nrow(svm.grid)) {
    for (i in 1:10) {
        # Segement my data by fold using the which() function
        indexes <- which(train_cv$cv_cohort == i)
        train <- train_cv[-indexes, ]
        val <- train_cv[indexes, ]
        # model
        svm.model <- svm(Revenue ~ ., kernel = "radial", type = "C-classification", 
            gamma = svm.grid$gamma[k], cost = svm.grid$cost[k], data = train[, 
                -12])
        svm.pred <- predict(svm.model, val[, -12])
        # evaluate
        svm.f1[i] <- F1_Score(val$Revenue, svm.pred, positive = "TRUE")
        svm.acc[i] <- sum(svm.pred == val$Revenue)/nrow(val)
    }
    svm.grid$f1[k] <- mean(svm.f1)
    svm.grid$acc[k] <- mean(svm.acc)
    print(paste("finished with:", k))
}
print(svm.cm <- table(svm.pred, val$Revenue))
svm.grid[which.max(svm.grid$f1), ]
# cost=1, gamma=0.2,f1= 0.5900601,acc= 0.8948096

Random Forest

rf.grid <- expand.grid(nt = seq(100, 500, by = 100), mrty = c(1, 3, 5, 7, 10))
rf.grid$acc <- rf.grid$f1 <- NA
rf.f1 <- rf.acc <- c()
for (k in 1:nrow(rf.grid)) {
    for (i in 1:10) {
        # Segement my data by fold using the which() function
        indexes <- which(train_cv$cv_cohort == i)
        train <- train_cv[-indexes, ]
        val <- train_cv[indexes, ]
        # model
        rf.model <- randomForest(Revenue ~ ., data = train[, -12], n.trees = rf.grid$nt[k], 
            mtry = rf.grid$mrty[k])
        rf.pred <- predict(rf.model, val[, -12])
        # evaluate
        rf.f1[i] <- F1_Score(val$Revenue, rf.pred, positive = "TRUE")
        rf.acc[i] <- sum(rf.pred == val$Revenue)/nrow(val)
    }
    rf.grid$f1[k] <- mean(rf.f1)
    rf.grid$acc[k] <- mean(rf.acc)
    print(paste("finished with:", k))
}
print(rf.cm <- table(rf.pred, val$Revenue))
rf.grid[which.max(rf.grid$f1), ]
# nt=200,mtry=3 ,f1 = 0.6330392, acc=0.8960723

Neural Network

nndata <- Train
nndata$Revenue <- ifelse(nndata$Revenue == "TRUE", 1, 0)
train_x <- model.matrix(~., nndata[, -11])
train_x <- train_x[, -1]
train_y <- to_categorical(as.integer(as.matrix(array(nndata[, 11]))), 2)
model <- keras_model_sequential()
# defining model's layers
model %>% layer_dense(units = 30, input_shape = 33, activation = "relu") %>% 
    layer_dense(units = 40, activation = "relu") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 60, activation = "relu") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 30, activation = "relu") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 2, activation = "sigmoid")
# defining model's optimizer
model %>% compile(loss = "binary_crossentropy", optimizer = "adam", metrics = c("accuracy"))
# Metrics: The performance evaluation module provides a series of functions
# for model performance evaluation. We use it to determine when the NN
# should stop train. The ultimate measure of performance is F1.
# Check which column in train_y is FALSE
table(train_y[, 1])  # the first column is FALSE
table(train_y[, 1])[[2]]/table(train_y[, 1])[[1]]
# Define a dictionary with your labels and their associated weights
weight = list(5.5, 1)  # the proportion of FALSE and TURE is about 5.5:1
# fitting the model on the training dataset
model %>% fit(train_x, train_y, epochs = 50, validation_split = 0.2, batch_size = 512, 
    class_weight = weight)
# after epoch = 20, val_loss not descrease and val_acc not increase, so NN
# should stop at epoch = 20

模型對比

GLM

glmdata <- Train
glmdata$Revenue <- ifelse(glmdata$Revenue == "TRUE", 1, 0)
testglm <- Test
testglm$Revenue <- ifelse(testglm$Revenue == "TRUE", 1, 0)
glm.model.f <- glm(Revenue ~ ., data = glmdata, family = binomial(link = "logit"))
glm.pred.f <- predict(glm.model.f, newdata = Test, type = "response")
glm.pred.f <- ifelse(glm.pred.f > 0.5, 1, 0)
glm.f1.f <- F1_Score(testglm$Revenue, glm.pred.f, positive = "1")
paste("The F1-score by Logistic regression classifier in test data is", glm.f1.f)

KNN

knndata <- as.data.frame(model.matrix(~., Train[, -11]))
knndata <- knndata[, -1]
knntest <- as.data.frame(model.matrix(~., Test[, -11]))
knntest <- knntest[, -1]
knn.model.f.pred <- knn(knndata, knntest, Train$Revenue, k = 7)
knn.f1.f <- F1_Score(Test$Revenue, knn.model.f.pred, positive = "TRUE")
paste("The F1-score by KNN classifier in test data is", knn.f1.f)

SVM

svm.model.f <- svm(Revenue ~ ., kernel = "radial", type = "C-classification", 
    gamma = 0.2, cost = 1, data = Train)
svm.pred.f <- predict(svm.model.f, Test)
svm.f1.f <- F1_Score(Test$Revenue, svm.pred.f, positive = "TRUE")
paste("The F1-score by SVM classifier in test data is", svm.f1.f)

Random Forests

rf.model.f <- randomForest(Revenue ~ ., data = Train, n.trees = 200, mtry = 3)
rf.pred.f <- predict(rf.model.f, Test)
rf.f1.f <- F1_Score(Test$Revenue, rf.pred.f, positive = "TRUE")
paste("The F1-score by Random Forests classifier in test data is", rf.f1.f)

NN

nndata <- Train
nndata$Revenue <- ifelse(nndata$Revenue == "TRUE", 1, 0)
train_x <- model.matrix(~., nndata[, -11])
train_x <- train_x[, -1]
train_y <- to_categorical(as.integer(as.matrix(array(nndata[, 11]))), 2)
model <- keras_model_sequential()
# defining model's layers
model %>% layer_dense(units = 30, input_shape = 33, activation = "relu") %>% 
    layer_dense(units = 40, activation = "relu") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 60, activation = "relu") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 30, activation = "relu") %>% layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 2, activation = "sigmoid")
# defining model's optimizer
model %>% compile(loss = "binary_crossentropy", optimizer = "adam", metrics = c("accuracy"))
weight = list(5.5, 1)
model %>% fit(train_x, train_y, epochs = 20, batch_size = 512, class_weight = weight)
# test data
testnn <- Test
testnn$Revenue <- ifelse(testnn$Revenue == "TRUE", 1, 0)
test_x <- model.matrix(~., testnn[, -11])
test_x <- test_x[, -1]
nn.pred <- model %>% predict(test_x)
nn.pred <- as.data.frame(nn.pred)
nn.pred$label <- NA
nn.pred$label <- ifelse(nn.pred$V2 > nn.pred$V1, "TRUE", "FALSE")
nn.pred$label <- as.factor(nn.pred$label)
nn.f1 <- F1_Score(Test$Revenue, nn.pred$label, positive = "TRUE")
paste("The F1-score by Neural network in test data is", nn.f1)

看一下結果對比哈,RF和NN的表現較好。最後做個混淆矩陣看一下。

# RF
print(rf.cm.f <- table(rf.pred.f, Test$Revenue))
##          
## rf.pred.f FALSE TRUE
##     FALSE   987   74
##     TRUE     50  116
# NN
print(nn.cm.f <- table(nn.pred$label, Test$Revenue))
##        
##         FALSE TRUE
##   FALSE   980   69
##   TRUE     57  121

到此這篇關於R語言數據建模流程分析的文章就介紹到這瞭,更多相關R語言數據建模內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet!

推薦閱讀: