R語言數據建模流程分析
Intro
近期在整理數據分析流程,找到瞭之前寫的一篇代碼,分享給大傢。這是我上學時候做的一個項目,當時由於經驗不足產生瞭一些問題,這些問題會在之後一點一點給大傢討論,避免各位踩坑。本篇分享會帶一些講解,可能有些地方不夠清楚,歡迎留言討論。
本次除瞭分享之外也是對自己之前項目的一個復盤。還是使用R語言(畢竟是我鐘愛的語言)。Python的如果有需求之後會放別的項目。
本篇中包含瞭數據導入,清洗,可視化,特征工程,建模的代碼,大傢可以選擇需要的去參考。
項目背景
數據來自Online Shopper’s Intention 包含12,330 條數據, 10個計數型特征和8個類別型特征。 使用‘Revenue’ 作為標簽進行建模。最終目的就是根據拿到的這些數據去建立一個可以預測Revenue的模型。
前期準備
首先你要下載一個R語言以及它的舒適版本R studio。怎麼下載呢,把我之前文章上的話直接粘過來哈哈
安裝R以及Rstudio
如果之前有用過R的朋友請忽略這一段。
安裝R非常簡單,直接官網下載
之後下載Rstudio,這個相當於R語言的開掛版,界面相比於R來說非常友好,輔助功能也很多,下載地址
#註意Rstudio是基於R語言的,需要下載安裝R語言後才可以安裝使用。
安裝好瞭後運行以下代碼來導入package們。
setwd("~/Desktop/STAT5003/Ass") #選擇項目存放的位置,同樣這也是你數據csv存放的位置 # install.packages("xxx") 如果之前沒有裝過以下的包,先用這句話來裝包,然後再去load # the following packages are for the EDA part library(GGally) library(ggcorrplot) library(psych) library(ggstatsplot) library(ggplot2) library(grid) # the following packages are for the Model part library(MASS) library(Boruta) # Feature selection with the Boruta algorithm library(caret) library(MLmetrics) library(class) library(neuralnet) library(e1071) library(randomForest) library(keras)
導入的包有些多,keras那個的安裝可以參考我之前的文章 (R語言基於Keras的MLP神經網絡詳解
https://blog.csdn.net/zhaotian151/article/details/84305440 )
數據描述
首先啊把這個數據下載到你的電腦上,然後用以下代碼導入R就可以瞭。
dataset <- read.csv("online_shoppers_intention.csv") str(dataset)
str()這個function可以看到你這個數據的屬性,輸出如下:
此時發現數據格式有int,number,factor等等。為瞭之後建分析和建模方便,我們先統一數據格式。
dataset$OperatingSystems <- as.factor(dataset$OperatingSystems) dataset$Browser <- as.factor(dataset$Browser) dataset$Region <- as.factor(dataset$Region) dataset$TrafficType <- as.factor(dataset$TrafficType) dataset$Weekend <- as.factor(dataset$Weekend) dataset$Revenue <- as.factor(dataset$Revenue) dataset$Administrative <- as.numeric(dataset$Administrative) dataset$Informational <- as.numeric(dataset$Informational) dataset$ProductRelated <- as.numeric(dataset$ProductRelated) summary(dataset)
現在數據格式基本統一啦,分為factor和numeric,這方便我們之後的操作。因為R裡面的一些package(尤其是建模的package)對數據的輸入格式有要求,所以提前處理好非常重要。這可以幫助你更好的整理數據以及敲出簡潔舒爽的代碼。
記住整理好數據格式之後summary()一下,你可以從這裡發現一些數據的小問題。比如下面的這個‘Administrative_Duration ’。
你看這min=-1就離譜,(當然這也是一個小坑)我們知道duration不可能是<0的。但這是我們的主觀思維,由於不知道這個數據在采集入數據庫的時候是怎麼定義的,所以這個-1是為啥我們不會知道原因。這也是為什麼我推薦做數據分析的時候要從頭開始跟項目,這樣你對數據瞭如指掌,而不是像現在這樣隻憑主觀思想去判斷數據對錯(雖然大部分時候你的主觀思想沒啥問題)
以下給一些數據解釋,就不翻譯瞭,看或不看都可(但你自己做項目的時候一定一定一定要仔細看)
Variables are described as follows:
Administrative : Administrative Value
Administrative_Duration : Duration in Administrative Page
Informational : Informational Value
Informational_Duration : Duration in Informational Page
ProductRelated : Product Related Value
ProductRelated_Duration : Duration in Product Related Page
BounceRates : Bounce Rates of a web page
ExitRates : Exit rate of a web page
PageValues : Page values of each web page
SpecialDay : Special days like valentine etc
Month : Month of the year
OperatingSystems : Operating system used
Browser : Browser used
Region : Region of the user
TrafficType : Traffic Type
VisitorType : Types of Visitor
Weekend : Weekend or not
Revenue : Revenue will be generated or not
數據清洗
我們在上一部分的summary已經發現瞭duration有小於0的,因此所有小於0的duration相關的,我們把它變成NA,然後算一下NA率,來判斷這些數是給它填補上還是直接刪。個人認為如果missing rate很小刪瞭就成。但如果你的數據集本身就不大,那建議你使用填值法填進去。因為數據太少的話就沒啥分析的必要。具體多少算少,見仁見智吧,感興趣的話之後可以寫一篇做討論。
dataset$Administrative_Duration[dataset$Administrative_Duration < 0] = NA dataset$Informational_Duration[dataset$Informational_Duration < 0] = NA dataset$ProductRelated_Duration[dataset$ProductRelated_Duration < 0] = NA missing.rate <- 1 - nrow(na.omit(dataset))/nrow(dataset) paste("missing rate =", missing.rate * 100, "%")
"missing rate = 0.381184103811838 %"還挺小的,所以直接刪掉有問題的數據。
dataset <- na.omit(dataset)
然後記得用summary再查一次哦,看看是否刪幹凈瞭。
預分析及預處理
數值型數據
下面三種分別是箱形圖,ggpairs以及相關性矩陣。 箱形圖可以用來觀察數據整體的分佈情況。ggpairs繪制的相關關系圖可以查看數據分佈和相關性。相關性矩陣專註於看相關系數以及是否相關性是否significant。這幾個各有其註重點,根據需要去做就可以。
par(mfrow = c(2, 5)) #讓圖片以2行5列的形式排列在一張圖上 boxplot(dataset$Administrative, main = "Administrative") boxplot(dataset$Administrative_Duration, main = "Administrative_Duration") boxplot(dataset$Informational, main = "Informational") boxplot(dataset$Informational_Duration, main = "Informational_Duration") boxplot(dataset$ProductRelated, main = "ProductRelated") boxplot(dataset$ProductRelated_Duration, main = "ProductRelated_Duration") boxplot(dataset$BounceRates, main = "BounceRates") boxplot(dataset$ExitRates, main = "ExitRates") boxplot(dataset$PageValues, main = "PageValues") boxplot(dataset$SpecialDay, main = "SpecialDay")
ggpairs(dataset[, c(1:10)])
corr = cor(dataset[, c(1:10)]) p.mat <- cor_pmat(dataset[, c(1:10)], use = "complete", method = "pearson") ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, p.mat = p.mat, insig = "blank")
類別型數據
針對類別型數據我們主要是看他的分佈,因此直接畫bar plot就成。下面的代碼用到瞭ggplot,是個非常好用的可視化包。grid.newpage()這裡主要是為瞭讓這些圖片都顯示在一張圖上,這樣把圖片導出或是直接在markdown上顯示的時候所有圖都會顯示在一個頁面上面,看起來比較美觀和舒適。
p1 <- ggplot(dataset, aes(x = SpecialDay)) + geom_bar(fill = "#CF6A1A", colour = "black") + theme_bw() p2 <- ggplot(dataset, aes(x = Month)) + geom_bar(fill = "#CF6A1A", colour = "black") + theme_bw() p3 <- ggplot(dataset, aes(x = OperatingSystems)) + geom_bar(fill = "#CF6A1A", colour = "black") + theme_bw() p4 <- ggplot(dataset, aes(x = Browser)) + geom_bar(fill = "#CF6A1A", colour = "black") + theme_bw() p5 <- ggplot(dataset, aes(x = Region)) + geom_bar(fill = "#CF6A1A", colour = "black") + theme_bw() p6 <- ggplot(dataset, aes(x = TrafficType)) + geom_bar(fill = "#CF6A1A", colour = "black") + theme_bw() p7 <- ggplot(dataset, aes(x = VisitorType)) + geom_bar(fill = "#CF6A1A", colour = "black") + theme_bw() p8 <- ggplot(dataset, aes(x = Weekend)) + geom_bar(fill = "#CF6A1A", colour = "black") + theme_bw() p9 <- ggplot(dataset, aes(x = Revenue)) + geom_bar(fill = "#CF6A1A", colour = "black") + theme_bw() grid.newpage() pushViewport(viewport(layout = grid.layout(4, 3, heights = unit(c(1, 3, 3, 3), "null")))) grid.text("Bar Plot of All Categorical Feature", vp = viewport(layout.pos.row = 1, layout.pos.col = 1:3)) vplayout = function(x, y) viewport(layout.pos.row = x, layout.pos.col = y) print(p1, vp = vplayout(2, 1)) print(p2, vp = vplayout(2, 2)) print(p3, vp = vplayout(2, 3)) print(p4, vp = vplayout(3, 1)) print(p5, vp = vplayout(3, 2)) print(p6, vp = vplayout(3, 3)) print(p7, vp = vplayout(4, 1)) print(p8, vp = vplayout(4, 2)) print(p9, vp = vplayout(4, 3))
我們可以看到,數據還是比較偏。我們想要預測的revenue也是非常imbalance(標簽中的false與true占比不均衡)。因此在處理數據或是選擇模型的時候要註意這一點。這裡不作詳細討論。針對imbalance data應該是有很多可以說的東西。之後有空的話可以細聊~
其實到目前為止,作為一個普通的項目來說,預分析可以結束瞭,我們查看瞭所有數據的分佈,並且對現有的數據有瞭一些直觀的印象。但我們不能滿足於此,因此對每一個類別型變量再做一次更細致的分析。
首先看一下這個 Special Day 。原數據裡給的這個special day給的是0,0.2,0.4這種數值,代表的是距離節日當天的日子,比如1就是節日當天,0.2是節日的前幾天(我記得大概是這樣)但這種就比較迷惑,我不知道這個具體是咋劃分的(這也是為啥希望大傢對你所研究的項目有非常深入的瞭解,你如果對此很瞭解,那麼很多分析的步驟是可以省略的),所以隻能讓數據告訴我,special day應該如何存在於我們之後的模型中。
special_day_check <- dataset[, c(10, 18)] special_day_check$Revenue <- ifelse(special_day_check$Revenue == "FALSE", 0, 1) special_day_check$SpecialDay[special_day_check$SpecialDay == 0] = NA special_day_check <- na.omit(special_day_check) special_day_glm <- glm(Revenue ~ SpecialDay, data = special_day_check, family = binomial(link = "logit")) summary(special_day_glm) ## ## Call: ## glm(formula = Revenue ~ SpecialDay, family = binomial(link = "logit"), ## data = special_day_check) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.3961 -0.3756 -0.3560 -0.3374 2.4491 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -2.3954 0.2986 -8.021 1.05e-15 *** ## SpecialDay -0.5524 0.4764 -1.159 0.246 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 578.11 on 1247 degrees of freedom ## Residual deviance: 576.77 on 1246 degrees of freedom ## AIC: 580.77 ## ## Number of Fisher Scoring iterations: 5
首先,我們要檢查的是special day 是否應該是一個數值變量。因此,建立一個glm模型(revenue = a+b*special_day),發現special day的p值=0.246(>0.05),因此可以數值型的認為“SpecialDay”不對revenue有顯著的影響,因此specialday可以被當作類別型變量。
現在我們把它當作類別型變量分析一下。用ggbarstats這個function。ggstatsplot是ggplot2包的擴展,主要用於創建美觀的圖片同時自動輸出統計學分析結果,其統計學分析結果包含統計分析的詳細信息,該包對於經常需要做統計分析的科研工作者來說非常有用。
ggbarstats(data = dataset, main = Revenue, condition = SpecialDay, sampling.plan = "jointMulti", title = "Revenue by Special Days", xlab = "Special Days", perc.k = 0.5, x.axis.orientation = "slant", ggstatsplot.layer = FALSE, messages = FALSE)
用此函數可以繪制出呈現分類變量的柱狀圖,圖中的上半部分( x P e a r s o n 2 x^2_{Pearson} xPearson2, p p p , V C r a m e r V_{Cramer} VCramer 等)代表傳統的統計學方法(Frequentist)的一些統計值,下面的部分( l o g e ( B F 01 ) log_e(BF_{01}) loge(BF01)等)代表貝葉斯(Bayesian)的一些統計值。
在本項目中,我們主要關註p-value,我們發現,p<0.001並且在柱狀圖上方所有都是***,這代表瞭非常顯著。因此我們可以確定special day就這樣作為類別型變量使用。
之後把每一個類別型變量都這樣做一下。過程不贅述瞭,挑一個有代表性的給大傢看一下。
我們看一下operating systems的ggbarstats()。
ggbarstats(data = dataset, main = Revenue, condition = OperatingSystems, sampling.plan = "jointMulti", title = "Revenue by Different Operating Systems", xlab = "Operating Systems", perc.k = 0.5, x.axis.orientation = "slant", ggstatsplot.layer = FALSE, messages = FALSE)
我們發現整體的p<0.001但是,因為在子類別的樣本少,所以柱狀圖上面出現瞭ns。我們知道,如果數據很少,那麼該數據便不具有統計價值,因此我們把這些少樣本的子類別合並在一起,再看一次。
dataset$OperatingSystems <- as.integer(dataset$OperatingSystems) dataset$OperatingSystems[dataset$OperatingSystems == "5"] <- "other" dataset$OperatingSystems[dataset$OperatingSystems == "6"] <- "other" dataset$OperatingSystems[dataset$OperatingSystems == "7"] <- "other" dataset$OperatingSystems <- as.factor(dataset$OperatingSystems) ggbarstats(data = dataset, main = Revenue, condition = OperatingSystems, sampling.plan = "jointMulti", title = "Revenue by Different Operating Systems", xlab = "Operating Systems", perc.k = 0.5, x.axis.orientation = "slant", ggstatsplot.layer = FALSE, messages = FALSE)
現在看起來就比較舒適瞭,都很顯著。
預處理和預分析到此結束。
特征
我們進行特征工程的最終目的就是提升模型的性能,比如你的數據特征很少的話我們需要建立一些二階、三階特征來豐富我們的數據。或是特征太多的時候我們需要進行降維處理。這裡我沒有做太多的特征工程,隻是把特征進行瞭一下基本的篩選,把沒有用的特征刪掉。這裡的邏輯是先用pca看一下可以保留多少特征,再用Boruta算法和stepAIC去選一下。
# PCA Since pca can only use on numeric data, so we use the os[,c(1:9)] pcdata <- os[, c(1:9)] pclable <- ifelse(os$Revenue == "TRUE", "red", "blue") pc <- princomp(os[, c(1:9)], cor = TRUE, scores = TRUE) summary(pc) ## Importance of components: ## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 ## Standard deviation 1.8387377 1.2923744 1.0134790 1.0020214 0.9697619 ## Proportion of Variance 0.3756618 0.1855813 0.1141266 0.1115608 0.1044931 ## Cumulative Proportion 0.3756618 0.5612431 0.6753697 0.7869305 0.8914236 ## Comp.6 Comp.7 Comp.8 Comp.9 ## Standard deviation 0.65008195 0.59319914 0.3510795 0.281849096 ## Proportion of Variance 0.04695628 0.03909836 0.0136952 0.008826546 ## Cumulative Proportion 0.93837989 0.97747825 0.9911735 1.000000000 plot(pc, type = "lines")
從pca裡面我們可以發現,保留7個numeric變量就可以有95%以上的方差。因此之後我們可以按著至少7個numeric variable這個標準去保留。
Boruta算法
set.seed(123) boruta.train <- Boruta(Revenue ~ ., data = os, doTrace = 2, maxRuns = 15) print(boruta.train) # Boruta performed 14 iterations in 3.920271 mins. 13 attributes confirmed # important: Administrative, Administrative_Duration, BounceRates, Browser, # ExitRates and 8 more; 1 attributes confirmed unimportant: SpecialDay; 2 # tentative attributes left: OperatingSystems, Weekend; so SpecialDay can be # delete when we fit the model. OperatingSystems and Weekend need to check # by other ways.
StepAIC
full.model <- glm(Revenue ~ . - SpecialDay, data = os, family = binomial(link = "logit")) # Backward Stepwise AIC stepback <- stepAIC(full.model, direction = "backward", steps = 3) summary(stepback) # OperatingSystems, Weekend are all above the <none>, combine the previous # result by Boruta algorithm, it can be delete when we fit model. Browser # has the minimum AIC, it can be delete when we fit model. PCA shows we # should keep 7 numeric variables in the dataset when fit the model, so two # numeric variables should be remove. Informational_Duration and # Administrative has the minimum AIC in numeric variables, so remove these # two variables.
綜合上面三個特征選擇的方法 SpecialDay, OperatingSystems, Weekend, Browser, Informational_Duration 和 Administrative 應當在建模的時候被移除。有興趣的可以跑一下上面的代碼,由於運行時間有點長,結果就直接碼在代碼框裡瞭。
建模
現在把用來建模數據整理好,準備建模。
os_modeldata <- os[, -c(1, 4, 10, 11, 12, 16)] # summary(os_modeldata) write.csv(os_modeldata, "os_modeldata.csv")
首先劃分訓練集和測試集(train 和 test)
set.seed(123) os_modeldata <- read.csv("os_modeldata.csv") os_modeldata <- os_modeldata[, -1] os_modeldata$Revenue <- as.factor(os_modeldata$Revenue) inTrain <- createDataPartition(os_modeldata$Revenue, p = 0.9)[[1]] Train <- os_modeldata[inTrain, ] Test <- os_modeldata[-inTrain, ]
然後把訓練集拆成train和val。這裡加瞭個10-cv。有些模型的function可以自己加cv,但由於要用到不同的建模package,為瞭避免不同package之間劃分cv的差異,咱自己建~
add_cv_cohorts <- function(dat, cv_K) { if (nrow(dat)%%cv_K == 0) { # if perfectly divisible dat$cv_cohort <- sample(rep(1:cv_K, each = (nrow(dat)%/%cv_K))) } else { # if not perfectly divisible dat$cv_cohort <- sample(c(rep(1:(nrow(dat)%%cv_K), each = (nrow(dat)%/%cv_K + 1)), rep((nrow(dat)%%cv_K + 1):cv_K, each = (nrow(dat)%/%cv_K)))) } return(dat) } # add 10-fold CV labels to real estate data train_cv <- add_cv_cohorts(Train, 10) # str(train_cv)
首先建一個基準模型,Logistic regression classifer(benchmark model)
train_cv_glm <- train_cv glm.acc <- glm.f1 <- c() train_cv_glm$Revenue <- ifelse(train_cv_glm$Revenue == "TRUE", 1, 0) # str(train_cv_glm) for (i in 1:10) { # Segement my data by fold using the which() function indexes <- which(train_cv_glm$cv_cohort == i) train <- train_cv_glm[-indexes, ] val <- train_cv_glm[indexes, ] # Model glm.model <- glm(Revenue ~ . - cv_cohort, data = train, family = binomial(link = "logit")) # predict glm.pred <- predict(glm.model, newdata = val, type = "response") glm.pred <- ifelse(glm.pred > 0.5, 1, 0) # evaluate glm.f1[i] <- F1_Score(val$Revenue, glm.pred, positive = "1") glm.acc[i] <- sum(glm.pred == val$Revenue)/nrow(val) } # F1 and ACC glm.acc.train <- round(mean(glm.acc), 5) * 100 glm.f1.train <- round(mean(glm.f1), 5) * 100 # print(glm.cm <- table(glm.pred, val$Revenue)) paste("The accuracy by Logistic regression classifier by 10-fold CV in train data is", glm.acc.train, "%") paste("The F1-score by Logistic regression classifier by 10-fold CV in train data is", glm.f1.train, "%") # f1 = 0.50331
然後建立我們用來對比的機器學習模型。這裡使用網格搜索法調參。
KNN
# since knn() function can't use factor as indenpent variable So re-coding # data, factor to dummy variable) train_cv_knn <- as.data.frame(model.matrix(~., train_cv[, -11])) train_cv_knn$Revenue <- train_cv$Revenue train_cv_knn <- train_cv_knn[, -1] # head(train_cv_knn) knn.grid <- expand.grid(k = c(1:30)) knn.grid$acc <- knn.grid$f1 <- NA knn.f1 <- knn.acc <- c() for (k in 1:nrow(knn.grid)) { for (i in 1:10) { # Segement my data by fold using the which() function indexes <- which(train_cv_knn$cv_cohort == i) train <- train_cv_knn[-indexes, ] val <- train_cv_knn[indexes, ] # model and predict knn.pred <- knn(train[, -c(34, 35)], val[, -c(34, 35)], train$Revenue, k = k) # evaluate knn.f1[i] <- F1_Score(val$Revenue, knn.pred, positive = "TRUE") knn.acc[i] <- sum(knn.pred == val$Revenue)/nrow(val) } knn.grid$f1[k] <- mean(knn.f1) knn.grid$acc[k] <- mean(knn.acc) print(paste("finished with =", k)) } print(knn.cm <- table(knn.pred, val$Revenue)) knn.grid[which.max(knn.grid$f1), ] # k = 7, f1=0.5484112, acc=0.885042
SVM
svm.grid <- expand.grid(cost = c(0.1, 1, 10), gamma = seq(0.2, 1, 0.2)) svm.grid$acc <- svm.grid$f1 <- NA svm.f1 <- svm.acc <- c() for (k in 1:nrow(svm.grid)) { for (i in 1:10) { # Segement my data by fold using the which() function indexes <- which(train_cv$cv_cohort == i) train <- train_cv[-indexes, ] val <- train_cv[indexes, ] # model svm.model <- svm(Revenue ~ ., kernel = "radial", type = "C-classification", gamma = svm.grid$gamma[k], cost = svm.grid$cost[k], data = train[, -12]) svm.pred <- predict(svm.model, val[, -12]) # evaluate svm.f1[i] <- F1_Score(val$Revenue, svm.pred, positive = "TRUE") svm.acc[i] <- sum(svm.pred == val$Revenue)/nrow(val) } svm.grid$f1[k] <- mean(svm.f1) svm.grid$acc[k] <- mean(svm.acc) print(paste("finished with:", k)) } print(svm.cm <- table(svm.pred, val$Revenue)) svm.grid[which.max(svm.grid$f1), ] # cost=1, gamma=0.2,f1= 0.5900601,acc= 0.8948096
Random Forest
rf.grid <- expand.grid(nt = seq(100, 500, by = 100), mrty = c(1, 3, 5, 7, 10)) rf.grid$acc <- rf.grid$f1 <- NA rf.f1 <- rf.acc <- c() for (k in 1:nrow(rf.grid)) { for (i in 1:10) { # Segement my data by fold using the which() function indexes <- which(train_cv$cv_cohort == i) train <- train_cv[-indexes, ] val <- train_cv[indexes, ] # model rf.model <- randomForest(Revenue ~ ., data = train[, -12], n.trees = rf.grid$nt[k], mtry = rf.grid$mrty[k]) rf.pred <- predict(rf.model, val[, -12]) # evaluate rf.f1[i] <- F1_Score(val$Revenue, rf.pred, positive = "TRUE") rf.acc[i] <- sum(rf.pred == val$Revenue)/nrow(val) } rf.grid$f1[k] <- mean(rf.f1) rf.grid$acc[k] <- mean(rf.acc) print(paste("finished with:", k)) } print(rf.cm <- table(rf.pred, val$Revenue)) rf.grid[which.max(rf.grid$f1), ] # nt=200,mtry=3 ,f1 = 0.6330392, acc=0.8960723
Neural Network
nndata <- Train nndata$Revenue <- ifelse(nndata$Revenue == "TRUE", 1, 0) train_x <- model.matrix(~., nndata[, -11]) train_x <- train_x[, -1] train_y <- to_categorical(as.integer(as.matrix(array(nndata[, 11]))), 2) model <- keras_model_sequential() # defining model's layers model %>% layer_dense(units = 30, input_shape = 33, activation = "relu") %>% layer_dense(units = 40, activation = "relu") %>% layer_dropout(rate = 0.4) %>% layer_dense(units = 60, activation = "relu") %>% layer_dropout(rate = 0.4) %>% layer_dense(units = 30, activation = "relu") %>% layer_dropout(rate = 0.4) %>% layer_dense(units = 2, activation = "sigmoid") # defining model's optimizer model %>% compile(loss = "binary_crossentropy", optimizer = "adam", metrics = c("accuracy")) # Metrics: The performance evaluation module provides a series of functions # for model performance evaluation. We use it to determine when the NN # should stop train. The ultimate measure of performance is F1. # Check which column in train_y is FALSE table(train_y[, 1]) # the first column is FALSE table(train_y[, 1])[[2]]/table(train_y[, 1])[[1]] # Define a dictionary with your labels and their associated weights weight = list(5.5, 1) # the proportion of FALSE and TURE is about 5.5:1 # fitting the model on the training dataset model %>% fit(train_x, train_y, epochs = 50, validation_split = 0.2, batch_size = 512, class_weight = weight) # after epoch = 20, val_loss not descrease and val_acc not increase, so NN # should stop at epoch = 20
模型對比
GLM
glmdata <- Train glmdata$Revenue <- ifelse(glmdata$Revenue == "TRUE", 1, 0) testglm <- Test testglm$Revenue <- ifelse(testglm$Revenue == "TRUE", 1, 0) glm.model.f <- glm(Revenue ~ ., data = glmdata, family = binomial(link = "logit")) glm.pred.f <- predict(glm.model.f, newdata = Test, type = "response") glm.pred.f <- ifelse(glm.pred.f > 0.5, 1, 0) glm.f1.f <- F1_Score(testglm$Revenue, glm.pred.f, positive = "1") paste("The F1-score by Logistic regression classifier in test data is", glm.f1.f)
KNN
knndata <- as.data.frame(model.matrix(~., Train[, -11])) knndata <- knndata[, -1] knntest <- as.data.frame(model.matrix(~., Test[, -11])) knntest <- knntest[, -1] knn.model.f.pred <- knn(knndata, knntest, Train$Revenue, k = 7) knn.f1.f <- F1_Score(Test$Revenue, knn.model.f.pred, positive = "TRUE") paste("The F1-score by KNN classifier in test data is", knn.f1.f)
SVM
svm.model.f <- svm(Revenue ~ ., kernel = "radial", type = "C-classification", gamma = 0.2, cost = 1, data = Train) svm.pred.f <- predict(svm.model.f, Test) svm.f1.f <- F1_Score(Test$Revenue, svm.pred.f, positive = "TRUE") paste("The F1-score by SVM classifier in test data is", svm.f1.f)
Random Forests
rf.model.f <- randomForest(Revenue ~ ., data = Train, n.trees = 200, mtry = 3) rf.pred.f <- predict(rf.model.f, Test) rf.f1.f <- F1_Score(Test$Revenue, rf.pred.f, positive = "TRUE") paste("The F1-score by Random Forests classifier in test data is", rf.f1.f)
NN
nndata <- Train nndata$Revenue <- ifelse(nndata$Revenue == "TRUE", 1, 0) train_x <- model.matrix(~., nndata[, -11]) train_x <- train_x[, -1] train_y <- to_categorical(as.integer(as.matrix(array(nndata[, 11]))), 2) model <- keras_model_sequential() # defining model's layers model %>% layer_dense(units = 30, input_shape = 33, activation = "relu") %>% layer_dense(units = 40, activation = "relu") %>% layer_dropout(rate = 0.4) %>% layer_dense(units = 60, activation = "relu") %>% layer_dropout(rate = 0.4) %>% layer_dense(units = 30, activation = "relu") %>% layer_dropout(rate = 0.4) %>% layer_dense(units = 2, activation = "sigmoid") # defining model's optimizer model %>% compile(loss = "binary_crossentropy", optimizer = "adam", metrics = c("accuracy")) weight = list(5.5, 1) model %>% fit(train_x, train_y, epochs = 20, batch_size = 512, class_weight = weight) # test data testnn <- Test testnn$Revenue <- ifelse(testnn$Revenue == "TRUE", 1, 0) test_x <- model.matrix(~., testnn[, -11]) test_x <- test_x[, -1] nn.pred <- model %>% predict(test_x) nn.pred <- as.data.frame(nn.pred) nn.pred$label <- NA nn.pred$label <- ifelse(nn.pred$V2 > nn.pred$V1, "TRUE", "FALSE") nn.pred$label <- as.factor(nn.pred$label) nn.f1 <- F1_Score(Test$Revenue, nn.pred$label, positive = "TRUE") paste("The F1-score by Neural network in test data is", nn.f1)
看一下結果對比哈,RF和NN的表現較好。最後做個混淆矩陣看一下。
# RF print(rf.cm.f <- table(rf.pred.f, Test$Revenue)) ## ## rf.pred.f FALSE TRUE ## FALSE 987 74 ## TRUE 50 116 # NN print(nn.cm.f <- table(nn.pred$label, Test$Revenue)) ## ## FALSE TRUE ## FALSE 980 69 ## TRUE 57 121
到此這篇關於R語言數據建模流程分析的文章就介紹到這瞭,更多相關R語言數據建模內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet!