手把手教你一整套R语言数据分析+建模流程

2023-08-23 10:05| 来源: 网络整理| 查看: 265

手把手教你一整套R语言数据分析+建模流程 Intro项目背景前期准备数据描述数据清洗预分析及预处理数值型数据类别型数据特征建模模型对比

Intro

近期在整理数据分析流程，找到了之前写的一篇代码，分享给大家。这是我上学时候做的一个项目，当时由于经验不足产生了一些问题，这些问题会在之后一点一点给大家讨论，避免各位踩坑。本篇分享会带一些讲解，可能有些地方不够清楚，欢迎留言讨论。

本次除了分享之外也是对自己之前项目的一个复盘。还是使用R语言（毕竟是我钟爱的语言）。Python的如果有需求之后会放别的项目。

本篇中包含了数据导入，清洗，可视化，特征工程，建模的代码，大家可以选择需要的去参考。

所有代码+注释+数据在这里 https://download.csdn.net/download/zhaotian151/85035590 付费下载链接，觉得文章有用可以适当支持，当然你也可以从这个文章里面一个个摘代码~

不建议还在上学同学直接抄作业哦，所有的项目都是宝贵的经验，是你上班之后确确实实可以用到的东西，希望大家上学的时候好好学习。

项目背景

数据来自Online Shopper’s Intention 包含12,330 条数据, 10个计数型特征和8个类别型特征。使用‘Revenue’ 作为标签进行建模。最终目的就是根据拿到的这些数据去建立一个可以预测Revenue的模型。有些同学说数据被kaggle从网上删掉了，数据的csv上传到资源里了，需要可以连同代码一起购买。

前期准备

首先你要下载一个R语言以及它的舒适版本R studio。下载方式如下：

安装R以及Rstudio 如果之前有用过R的朋友请忽略这一段。安装R非常简单，直接官网下载

之后下载Rstudio，这个相当于R语言的开挂版，界面相比于R来说非常友好，辅助功能也很多，下载地址

#注意Rstudio是基于R语言的，需要下载安装R语言后才可以安装使用。

安装好了后运行以下代码来导入package们。

setwd("~/Desktop/STAT5003/Ass") #选择项目存放的位置，同样这也是你数据csv存放的位置 # install.packages("xxx") 如果之前没有装过以下的包，先用这句话来装包，然后再去load # the following packages are for the EDA part library(GGally) library(ggcorrplot) library(psych) library(ggstatsplot) library(ggplot2) library(grid) # the following packages are for the Model part library(MASS) library(Boruta) # Feature selection with the Boruta algorithm library(caret) library(MLmetrics) library(class) library(neuralnet) library(e1071) library(randomForest) library(keras)

导入的包有些多，keras那个的安装可以参考我之前的文章（R语言基于Keras的MLP神经网络详解 https://blog.csdn.net/zhaotian151/article/details/84305440 ）

数据描述

首先啊把这个数据下载到你的电脑上，然后用以下代码导入R就可以了。

dataset # if not perfectly divisible dat$cv_cohort for (i in 1:10) { # Segement my data by fold using the which() function indexes # Segement my data by fold using the which() function indexes # Segement my data by fold using the which() function indexes % compile(loss = "binary_crossentropy", optimizer = "adam", metrics = c("accuracy")) # Metrics: The performance evaluation module provides a series of functions # for model performance evaluation. We use it to determine when the NN # should stop train. The ultimate measure of performance is F1. # Check which column in train_y is FALSE table(train_y[, 1]) # the first column is FALSE table(train_y[, 1])[[2]]/table(train_y[, 1])[[1]] # Define a dictionary with your labels and their associated weights weight = list(5.5, 1) # the proportion of FALSE and TURE is about 5.5:1 # fitting the model on the training dataset model %>% fit(train_x, train_y, epochs = 50, validation_split = 0.2, batch_size = 512, class_weight = weight) # after epoch = 20, val_loss not descrease and val_acc not increase, so NN # should stop at epoch = 20 模型对比

GLM

glmdata % compile(loss = "binary_crossentropy", optimizer = "adam", metrics = c("accuracy")) weight = list(5.5, 1) model %>% fit(train_x, train_y, epochs = 20, batch_size = 512, class_weight = weight) # test data testnn

【本文地址】

公司简介

联系我们