Pandas项目实战1

2023-09-21 13:31| 来源: 网络整理| 查看: 265

文章目录好莱坞百万级电影评论数据分析 Pandas 知识点任务需求 1.导入所需库 2.导入数据读取user 读取Movie 读取RATINGS 3. 数据合并 4.平均分较高电影 5. 不同性别对电影评分 6.不同性别争议最大的电影 7.评论次数最多热门的电影 8.查看不同年龄段争议最大电影 9.每个年龄段用户评分人数和打分偏好 10.优化数据分析，结果真实可靠 10.1 加入评分次数限制来分析不同性别对电影的平均分 10.2 加入评分次数限制分析平均分高的电影总结

好莱坞百万级电影评论数据分析

经过Pandas的入门学习，急需要通过一些简单的项目来将所学知识和用法融会贯通，这里选择对好莱坞百万级电影评论数据进行分析处理，下面就开始吧~

Pandas 知识点数据读取数据集成透视表数据聚合与分组运算分段统计数据可视化任务需求数据加载和集成平均分较高电影不同性别对电影平均评分不同性别争议最大电影评分次数最多热门的电影不同年龄段争议最大的电影优化与总结

本文所使用的所有数据链接：

链接: https://pan.baidu.com/s/1KBphl8o-YEFXVp8N1IlsgA 提取码: 8daa

操作环境：Jupyter Notebook

1.导入所需库 import numpy as np import pandas as pd # draw import matplotlib.pyplot as plt %matplotlib inline 2.导入数据读取user

通过查看README可以得到USER数据的格式如下：

USERS FILE DESCRIPTION User information is in the file “users.dat” and is in the following format:

UserID::Gender::Age::Occupation::Zip-code

此处索引命名不一定非要一致,自己明白即可

# shift + Tab 查看函数提示 # 创建索引列表 labels = ['UserID','Gender','Age','Occupation','Zip-code'] # 以此输入路径，分隔符，不作为头部，赋值索引 users = pd.read_csv('./users.dat',sep = '::', header= None, names =labels) # 读取后查看维度 users.shape (6040, 5)

若有红色输出则即可当做log日志，不用惊慌

users.head() UserID Gender Age Occupation Zip-code 0 1 F 1 10 48067 1 2 M 56 16 70072 2 3 M 25 15 55117 3 4 M 45 7 02460 4 5 M 25 20 55455 读取Movie

MOVIES FILE DESCRIPTION

Movie information is in the file “movies.dat” and is in the following format:

MovieID::Title::Genres

labels2 = ['MovieID','Title','Genres'] movie =pd.read_csv('./movies.dat',sep='::',header = None,names=labels2) # display同时显示两个 display(movie.head(),movie.shape) MovieID Title Genres 0 1 Toy Story (1995) Animation|Children’s|Comedy 1 2 Jumanji (1995) Adventure|Children’s|Fantasy 2 3 Grumpier Old Men (1995) Comedy|Romance 3 4 Waiting to Exhale (1995) Comedy|Drama 4 5 Father of the Bride Part II (1995) Comedy (3883, 3) 读取RATINGS

RATINGS FILE DESCRIPTION

All ratings are contained in the file “ratings.dat” and are in the following format:

UserID::MovieID::Rating::Timestamp

labels3 = ['UserID','MovieID','Rating','Time'] ratings =pd.read_csv('./ratings.dat',sep='::',header = None,names=labels3) # display()同时显示两组数据 display(ratings.head(),ratings.shape)

这里读取百万级数据可能需要稍作等待。。。

UserID MovieID Rating Time 0 1 1193 5 978300760 1 1 661 3 978302109 2 1 914 3 978301968 3 1 3408 4 978300275 4 1 2355 5 978824291 (1000209, 4) 3. 数据合并

由于数据分布在三个表，所以需要对数据进行数据集成，首先将三张表简单展示在一起，查看各自特征。

display(users.head(),movie.head(),ratings.head())

UserID Gender Age Occupation Zip-code 0 1 F 1 10 48067 1 2 M 56 16 70072 2 3 M 25 15 55117 3 4 M 45 7 02460 4 5 M 25 20 55455 MovieID Title Genres 0

【本文地址】

公司简介

联系我们