基于关联规则算法的电商数据挖掘 | 您所在的位置:网站首页 › aprioiri算法 › 基于关联规则算法的电商数据挖掘 |
大家好,我是Peter~ 本文是基于机器学习的关联规则方法对IC电子产品的数据挖掘,主要内容包含: 数据预处理:针对数据去重、缺失值处理、时间字段处理、用户年龄分段等词云图制作:不同用户对不同品牌brand和种类category_code的偏好关联规则挖掘:针对不同性别、不同品牌的关联信息挖掘本文关键词:电商、关联规则、机器学习、词云图 数据基本信息导入数据In 1: import pandas as pd import numpy as np # 显示所有列 # pd.set_option('display.max_columns', None) # 显示所有行 # pd.set_option('display.max_rows', None) # 设置value的显示长度为100,默认为50 # pd.set_option('max_colwidth',100) import time import os from datetime import datetime import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline #设置中文编码和负号的正常显示 plt.rcParams['font.sans-serif']=['SimHei'] plt.rcParams['axes.unicode_minus']=False import missingno as ms from pyecharts.globals import CurrentConfig, OnlineHostType from pyecharts import options as opts # 配置项 from pyecharts.charts import Bar, Scatter, Pie, Line,Map, WordCloud, Grid, Page # 各个图形的类 from pyecharts.commons.utils import JsCode from pyecharts.globals import ThemeType,SymbolType import plotly.express as px import plotly.graph_objects as go from plotly.subplots import make_subplots # 画子图 import jieba from snownlp import SnowNLP from sklearn.cluster import KMeans from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler import warnings warnings.filterwarnings("ignore")In 2: # 数据中存在中文,指定读取的编码格式 df = pd.read_csv("ic_sale.csv", encoding="gb18030", # windows系统需要指定类型;mac不需要 converters={"order_id":str,"product_id":str,"category_id":str,"user_id":str} ) df.head()Out2: 基本信息In 3: # 1、数据shape df.shapeOut3: (564169, 11)In 4: # 2、数据字段类型 df.dtypesOut4: event_time object order_id object product_id object category_id object category_code object brand object price float64 user_id object age int64 sex object local object dtype: objectIn 5: # 3、数据描述统计信息 df.describe()Out5: price age count 564169.000000 564169.000000 mean 208.269324 33.184388 std 304.559875 10.122088 min 0.000000 16.000000 25% 23.130000 24.000000 50% 87.940000 33.000000 75% 277.750000 42.000000 max 18328.680000 50.000000 In 6: # 4、总共多少个不同客户 df["user_id"].nunique()Out6: 6908数据预处理数据去重处理In 7: df.shape # 去重前Out7: (564169, 11)In 8: df.drop_duplicates(ignore_index=True,inplace=True)In 9: df.shape # 去重后Out9: (561214, 11)特征信息In 10: stats = [] for col in df.columns: stats.append((col, df[col].nunique(), round(df[col].isnull().sum() * 100 / df.shape[0], 3), round(df[col].value_counts(normalize=True, dropna=False).values[0] * 100,3), df[col].dtype) ) stats_df = pd.DataFrame(stats, columns=['特征名', '属性个数', '缺失值占比', '最大属性占比', '特征类型']) stats_df.sort_values('缺失值占比', ascending=False, ignore_index=True)缺失值处理In 11: df = df[df["price"] > 0]In 12: df.isnull().sum()Out12: event_time 0 order_id 0 product_id 0 category_id 0 category_code 128662 brand 27132 price 0 user_id 0 age 0 sex 0 local 0 dtype: int64In 13: ms.bar(df,color="red") # 缺失值可视化 plt.show()最后直接填充缺失值:missing In 14: df.fillna("missing",inplace=True) # 填充missing时间字段处理In 15: df["event_time"].value_counts()Out15: 1970-01-01 00:33:40 UTC 1302 2020-04-09 16:30:01 UTC 51 2020-04-08 16:30:01 UTC 49 2020-04-06 16:30:01 UTC 46 2020-04-05 16:30:01 UTC 44 ... 2020-07-28 13:10:35 UTC 1 2020-07-28 13:10:21 UTC 1 2020-07-28 13:09:37 UTC 1 2020-07-28 13:08:23 UTC 1 2020-08-13 17:16:24 UTC 1 Name: event_time, Length: 389813, dtype: int64从上面的结果中看到:1970-01-01 00:33:40最多,其实就是时间字段的缺失值 In 16: # 去掉最后的UTC df["event_time"] = df["event_time"].apply(lambda x: x[:19]) # 时间数据类型转化:字符类型---->指定时间格式 df['event_time'] = pd.to_datetime(df['event_time'], format="%Y-%m-%d %H:%M:%S") # 提取多个时间相关字段 # df['month']=df['event_time'].dt.month # df['day'] = df['event_time'].dt.day # df['dayofweek']=df['event_time'].dt.dayofweek # df['hour']=df['event_time'].dt.hour用户年龄分段In 17: # 不同性别下的年龄分布 fig = px.box(df,y=["age"], color="sex") fig.show()# 不同年龄段人数统计 fig = plt.figure(figsize=(12,6)) sns.countplot(df["age"]) plt.title("Counts of Different Age") plt.show()针对年龄字段的分箱操作: In 19: df["age"] = pd.cut(df["age"],bins=4,precision=0) df["age"] # 分段之后的age字段显示Out19: 0 (16.0, 24.0] 1 (33.0, 42.0] 2 (24.0, 33.0] 3 (16.0, 24.0] 4 (16.0, 24.0] ... 561209 (16.0, 24.0] 561210 (16.0, 24.0] 561211 (16.0, 24.0] 561212 (16.0, 24.0] 561213 (16.0, 24.0] Name: age, Length: 561175, dtype: category Categories (4, interval[float64, right]): [(16.0, 24.0] < (24.0, 33.0] < (33.0, 42.0] < (42.0, 50.0]]不同地区用户的消费水平对比In 22: fig = px.scatter(df[df["brand"] != "missing"], # 除去missing数据 # x="local", y="price", facet_col="age", color="local", size="price" ) fig.show()不同年龄段和性别的品牌偏好In 23: age_brand = df.groupby(["age","sex","brand"]).size().reset_index().rename(columns={0:"number"}) age_brand.head()Out23: age sex brand number 0 (16.0, 24.0] 女 a-case 32 1 (16.0, 24.0] 女 acana 0 2 (16.0, 24.0] 女 accesstyle 3 3 (16.0, 24.0] 女 action 0 4 (16.0, 24.0] 女 activision 3 In 24: # 实现排序功能-降序 age_brand = age_brand.sort_values(["age","number"],ascending=[True,False],ignore_index=True) age_brand.head()Out24: age sex brand number 0 (16.0, 24.0] 男 samsung 11884 1 (16.0, 24.0] 女 samsung 11882 2 (16.0, 24.0] 男 apple 4561 3 (16.0, 24.0] 女 apple 4283 4 (16.0, 24.0] 男 missing 3354 In 25: # 条件筛选 age_brand = age_brand.query("number > 0 & brand != 'missing'")In 26: fig = px.treemap( age_brand, # 传入数据 path=[px.Constant("all"),"age","sex","brand"], # 传递数据路径 values="number" # 数值显示 ) fig.update_traces(root_color="lightskyblue") fig.update_layout(margin=dict(t=30,l=30,r=25,b=30)) fig.show()品牌数量词云图In 27: age_brand.head()Out27: age sex brand number 0 (16.0, 24.0] 男 samsung 11884 1 (16.0, 24.0] 女 samsung 11882 2 (16.0, 24.0] 男 apple 4561 3 (16.0, 24.0] 女 apple 4283 6 (16.0, 24.0] 男 ava 3317 In 28: brand_list = age_brand["brand"].value_counts().reset_index() brand_list.columns=["word","number"] brand_list.head(10)Out28: word number 0 samsung 8 1 darina 8 2 huion 8 3 aquapick 8 4 amigami 8 5 sjcam 8 6 rockstar 8 7 franke 8 8 bridgestone 8 9 tailg 8 In 29: information_zip = [tuple(z) for z in zip(brand_list["word"].tolist(), brand_list["number"].tolist())] # 绘图 c = ( WordCloud() .add("", information_zip, word_size_range=[20, 80], shape=SymbolType.DIAMOND) .set_global_opts(title_opts=opts.TitleOpts(title="品牌词云图")) ) c.render_notebook()不同品牌的不同种类category_codecategory_code处理查看有多少种不同的category_code和对应的数量,使用value_counts()方法: In 30: df["category_code"].value_counts()Out30: missing 128662 electronics.smartphone 101502 computers.notebook 25917 appliances.kitchen.refrigerators 20296 electronics.audio.headphone 20049 ... kids.swing 8 country_yard.watering 5 sport.snowboard 3 apparel.costume 2 apparel.shoes 2 Name: category_code, Length: 124, dtype: int64结论:除去missing部分,最多的是electronics.smartphone,即:电子智能手机,其次就是电脑笔记本 In 31: fig = px.bar(df["category_code"].value_counts()[1:30]) # 前30个category_code fig.show()只选取需要的字段: In 32: df = df[df["category_code"] != "missing"] # 去除missing部分 df = df[["category_code", "brand","age", "sex", "local"]]将category_code字段进行切割处理: In 33: df["category_code"] = df["category_code"].apply(lambda x: x.split(".") if "." in x else [x]) df.head()Out33: category_code brand age sex local 0 electronics, tablet samsung (16.0, 24.0] 女 海南 1 electronics, audio, headphone huawei (33.0, 42.0] 女 北京 3 furniture, kitchen, table maestro (16.0, 24.0] 男 重庆 4 electronics, smartphone apple (16.0, 24.0] 男 北京 5 appliances, kitchen, refrigerators lg (16.0, 24.0] 男 北京 category_code词云图In 34: data = df["category_code"].tolist() data[:3]Out34: [['electronics', 'tablet'], ['electronics', 'audio', 'headphone'], ['furniture', 'kitchen', 'table']]In 35: import itertools # 通过chain方法从可迭代对象中生成;展开成列表 sum_data = list(itertools.chain.from_iterable(data)) sum_data[:10]Out35: ['electronics', 'tablet', 'electronics', 'audio', 'headphone', 'furniture', 'kitchen', 'table', 'electronics', 'smartphone']In 36: category_code_number = pd.value_counts(sum_data).to_frame().reset_index() category_code_number.columns=["category_code","number"] category_code_number.head()Out36: category_code number 0 electronics 156709 1 appliances 150331 2 kitchen 107852 3 smartphone 101502 4 computers 76877 In 37: information_zip = [tuple(z) for z in zip(category_code_number["category_code"].tolist(), category_code_number["number"].tolist())] # 绘图 c = ( WordCloud() .add("", information_zip, word_size_range=[20, 80], shape=SymbolType.DIAMOND) .set_global_opts(title_opts=opts.TitleOpts(title="商品种类词云图")) ) c.render_notebook()基于关联规则建模基于性别sex查找频繁项集-maleIn 38: male = df[df["sex"] == "男"] male.head()Out38: category_code brand age sex local 3 furniture, kitchen, table maestro (16.0, 24.0] 男 重庆 4 electronics, smartphone apple (16.0, 24.0] 男 北京 5 appliances, kitchen, refrigerators lg (16.0, 24.0] 男 北京 6 appliances, personal, scales polaris (24.0, 33.0] 男 广东 17 appliances, kitchen, kettle tefal (33.0, 42.0] 男 广东 In 39: import efficient_apriori as ea male_list = male["category_code"].tolist() # itemsets:频繁项 rules:关联规则 itemsets, rules = ea.apriori(male_list, min_support=0.005, min_confidence=1 )一个频繁项In 40: len(itemsets[1])Out40: 60In 41: itemsets[1] # 一个频繁项集# 字典的值value的降序排列 dict(sorted(itemsets[1].items(), key=lambda x: x[1], reverse=True))二个频繁项In 43: len(itemsets[2]) # 总个数Out43: 84In 44: # 两个频繁项集 dict(sorted(itemsets[2].items(), key=lambda x: x[1], reverse=True))三个频繁项In 45: len(itemsets[3]) # 总个数Out45: 32In 46: # 三个频繁项集 dict(sorted(itemsets[3].items(), key=lambda x: x[1], reverse=True))Out46: {('appliances', 'kitchen', 'refrigerators'): 10209, ('audio', 'electronics', 'headphone'): 10154, ('electronics', 'tv', 'video'): 8876, ('appliances', 'environment', 'vacuum'): 8069, ('appliances', 'kitchen', 'washer'): 7235, ('appliances', 'kettle', 'kitchen'): 6389, ('computers', 'mouse', 'peripherals'): 6359, ('furniture', 'kitchen', 'table'): 5626, ('appliances', 'hood', 'kitchen'): 4487, ('appliances', 'blender', 'kitchen'): 4439, ('appliances', 'kitchen', 'microwave'): 3830, ('air_conditioner', 'appliances', 'environment'): 3806, ('appliances', 'personal', 'scales'): 3423, ('computers', 'network', 'router'): 3318, ('components', 'computers', 'hdd'): 2598, ('appliances', 'kitchen', 'meat_grinder'): 2361, ('components', 'computers', 'cpu'): 2055, ('appliances', 'kitchen', 'oven'): 1958, ('appliances', 'environment', 'fan'): 1952, ('computers', 'keyboard', 'peripherals'): 1940, ('computers', 'peripherals', 'printer'): 1802, ('appliances', 'environment', 'water_heater'): 1753, ('computers', 'monitor', 'peripherals'): 1733, ('components', 'computers', 'cooler'): 1717, ('cabinet', 'furniture', 'living_room'): 1550, ('chair', 'furniture', 'kitchen'): 1513, ('appliances', 'hair_cutter', 'personal'): 1388, ('air_heater', 'appliances', 'environment'): 1341, ('appliances', 'dishwasher', 'kitchen'): 1329, ('furniture', 'living_room', 'shelving'): 1314, ('appliances', 'kitchen', 'mixer'): 1288, ('construction', 'screw', 'tools'): 1194}查找频繁项集-femaleIn 47: female = df[df["sex"] == "女"] female.head()Out47: category_code brand age sex local 0 electronics, tablet samsung (16.0, 24.0] 女 海南 1 electronics, audio, headphone huawei (33.0, 42.0] 女 北京 7 electronics, video, tv samsung (16.0, 24.0] 女 北京 8 computers, components, cpu intel (42.0, 50.0] 女 浙江 10 computers, notebook asus (42.0, 50.0] 女 广东 In 48: import efficient_apriori as ea female_list = male["category_code"].tolist() # itemsets:频繁项 rules:关联规则 itemsets, rules = ea.apriori(female_list, min_support=0.005, min_confidence=1 )一个频繁项In 49: len(itemsets[1]) # 总个数Out49: 60In 50: # 一个频繁项集 dict(sorted(itemsets[1].items(), key=lambda x: x[1], reverse=True))二个频繁项In 51: # 两个频繁项集 dict(sorted(itemsets[2].items(), key=lambda x: x[1], reverse=True))三个频繁项In 52: # 三个频繁项集 dict(sorted(itemsets[3].items(), key=lambda x: x[1], reverse=True))基于品牌brandIn 53: brand_category = df.groupby(["brand"])["category_code"].sum().reset_index() brand_category# 去重功能-set brand_category["category_code"] = brand_category["category_code"].apply(lambda x: list(set(x))) brand_categoryimport efficient_apriori as ea brand_list = brand_category["category_code"].tolist() # itemsets:频繁项 rules:关联规则 itemsets, rules = ea.apriori( brand_list, min_support=0.05, min_confidence=1 ) # 三个频繁项集 dict(sorted(itemsets[3].items(), key=lambda x: x[1], reverse=True))# 两个频繁项集 dict(sorted(itemsets[2].items(), key=lambda x: x[1], reverse=True))# 一个频繁项集 dict(sorted(itemsets[1].items(), key=lambda x: x[1], reverse=True))结论从消费用户的年龄来看,平均在33岁,属于主力消费且有一定经济实力的人群;从用户的产品偏好来看,用户主要喜欢:三星、苹果、ava(主营儿童产品,比如儿童头盔、摩托车)、tefal(特福,主要家电产品,比如蒸锅、不粘锅等)从用户搜索的产品种类来看,用户更关注的是smartphone、kitchen、electronics;也就说:智能手机、厨房用品和电子产品是用户的关注点从关联规则挖掘到的信息来看:男性/女性的关联产品信息可能是electronics与smartphone,appliances与kitchen,或者computers与notebook在同一个品牌中,appliances和kitchen;以及audio--->electronics--->headphone是主要关联产品 |
CopyRight 2018-2019 实验室设备网 版权所有 |