Pandas模块：读入excel/csv文件，过滤重复值和缺失值处理

2024-05-31 10:18| 来源: 网络整理| 查看: 265

excel文件数据为：

首先导入Pandas模块

import pandas as pd

读取excel文件

# 读取excel文件 file_01 = pd.read_excel("data.xls") print(file_01, '\n') print('数据维度为：', file_01.shape, '\n', 'file_01的类型为：', type(file_01))

返回的结果为

Unnamed: 0 时间变量1 变量2 变量3 变量4 变量5 输出应变量 0 1 2021-07-21 00:00:00 1 0 1 1.0 1 94.354 1 2 2021-07-22 00:00:00 1 0 1 1.0 2 118.340 2 3 2021-07-23 00:00:00 1 0 1 1.0 3 93.791 3 4 2021-07-24 00:00:00 1 0 1 2.0 1 86.593 4 5 2021-07-25 17:55:00 1 0 1 2.0 2 100.280 5 6 2021-07-26 17:55:00 1 0 1 2.0 3 94.121 6 7 2021-07-27 17:55:00 1 0 1 3.0 1 NaN 7 8 2021-07-28 17:55:00 1 0 1 3.0 2 113.920 8 9 2021-07-29 17:55:00 1 0 1 3.0 3 95.566 9 10 2021-07-30 17:55:00 1 0 1 NaN 1 100.870 10 11 2021-07-31 17:55:00 1 0 1 4.0 2 116.130 11 12 2021-08-01 17:55:00 1 0 1 4.0 3 103.830 12 13 2021-08-02 17:55:00 1 0 1 4.0 2 124.870 13 14 2021-08-03 17:55:00 1 0 1 4.0 3 91.475 14 15 2021-08-04 17:55:00 1 0 2 4.0 1 52.083 15 16 2021-08-05 17:55:00 1 0 2 4.0 2 45.912 16 17 2021-08-06 17:55:00 1 0 2 4.0 3 48.534 17 18 2021-08-07 17:55:00 1 0 2 4.0 1 21.255 18 19 2021-08-08 17:55:00 1 0 2 4.0 2 32.191 19 20 2021-08-09 17:55:00 1 0 2 4.0 3 24.832 数据维度为： (20, 8) file_01的类型为：

可见返回的file_01的类型为DataFrame类型，相当于一个表格，表格中的缺失值用NaN来代替。

对数据进行预处理，需要判断数据是否重复和缺失，并对缺失的数据进行删去或者补充。

判断是否存在完全一致的数据行，可用duplicated函数

# 判断是否存在重复的行,若存在重复的行，返回True，否则返回False duplicated_data = file_01.duplicated() print(duplicated_data)

如果需要去掉重复的行，可使用drop_duplicates()函数

# 去掉重复的行 # print(file_01.drop_duplicates())

如果需要输出一列去除重复值之后的数据，可使用unique()函数，结果返回的是去重之后的数据

print(file_01['输出应变量'].unique()) # 返回去除重复值之后的变量 print(type(file_01['输出应变量'])) >>> [ 94.354 118.34 93.791 86.593 100.28 94.121 nan 113.92 95.566 100.87 116.13 103.83 124.87 91.475 52.083 45.912 48.534 21.255 32.191 24.832]

需要注意的是，此时的一列数据为series类型，需要与numpy中的列表list进行区分，numpy中的unique()函数作用也是输出去重之后的数据，但是输出结果的顺序为数字，字符，空值

import numpy as np # 自定义一个list列表 lst = ['aa', np.nan, 9, 2, 1, 3, 5, 2, 'a', 'b', 'aa'] print(np.unique(lst)) print(type(lst)) >>> ['1' '2' '3' '5' '9' 'a' 'aa' 'b' 'nan']

判断是否有缺失值使用isnull()函数

""" 使用isnull()函数判断一行或者一列是否存在缺失值，可使用axis关键字指定，axis=0为指定列变量；axis=1为指定行变量 any()表示一行或者一列只要存在缺失值isnull函数就返回True，否则返回False""" print(file_01.isnull().any(axis=0))

运行结果：

# 返回结果 Unnamed: 0 False 时间 False 变量1 False 变量2 False 变量3 False 变量4 True 变量5 False 输出应变量 True dtype: bool

丢弃缺失值使用dropna()函数

# axis指定行或者列 # how指定丢弃方式，any指定只要有一行/列中某个数据存在缺失，就丢弃/all所有都为空就丢弃 # subset可指定具体的列 file_02 = file_01.dropna(how='any', axis=0) print(file_02)

丢弃结果：

Unnamed: 0 时间变量1 变量2 变量3 变量4 变量5 输出应变量 0 1 2021-07-21 00:00:00 1 0 1 1.0 1 94.354 1 2 2021-07-22 00:00:00 1 0 1 1.0 2 118.340 2 3 2021-07-23 00:00:00 1 0 1 1.0 3 93.791 3 4 2021-07-24 00:00:00 1 0 1 2.0 1 86.593 4 5 2021-07-25 17:55:00 1 0 1 2.0 2 100.280 5 6 2021-07-26 17:55:00 1 0 1 2.0 3 94.121 7 8 2021-07-28 17:55:00 1 0 1 3.0 2 113.920 8 9 2021-07-29 17:55:00 1 0 1 3.0 3 95.566 10 11 2021-07-31 17:55:00 1 0 1 4.0 2 116.130 11 12 2021-08-01 17:55:00 1 0 1 4.0 3 103.830 12 13 2021-08-02 17:55:00 1 0 1 4.0 2 124.870 13 14 2021-08-03 17:55:00 1 0 1 4.0 3 91.475 14 15 2021-08-04 17:55:00 1 0 2 4.0 1 52.083 15 16 2021-08-05 17:55:00 1 0 2 4.0 2 45.912 16 17 2021-08-06 17:55:00 1 0 2 4.0 3 48.534 17 18 2021-08-07 17:55:00 1 0 2 4.0 1 21.255 18 19 2021-08-08 17:55:00 1 0 2 4.0 2 32.191 19 20 2021-08-09 17:55:00 1 0 2 4.0 3 24.832

进行缺失值的填充，可使用fillna()函数中的value进行指定，可指定数字或字符串

# 缺失值的填充 file_03 = file_01.fillna(value=0) print(file_03)

【本文地址】

公司简介

联系我们