3 Python数据分析美国各州人口分析案例 Pandas高级操作美国大选献金案例 matplotlib

2024-07-05 09:27| 来源: 网络整理| 查看: 265

Python数据分析 1 案例美国各州人口分析 1.1 数据介绍

数据来源：https://github.com/jakevdp/data-USstates/

1.1.1 州人口数量表 state-population.csv 字段字段名备注state/region州/区域州简称，与state-abbrevs.csv文件中的abbreviation字段对应ages人口年龄选项包括18岁以下under18和全部年龄段totalyear年份population人口数量 1.1.2 州面积表 state-areas.csv 字段字段名备注state州名称州全称，与state-abbrevs.csv文件中的state字段对应area (sq. mi)州面积 1.1.3 州简称对照表 state-abbrevs.csv 字段字段名备注state州全称abbreviation州简称 1.2 需求

将人口数据与各州简称数据进行合并，对合并数据中重复的abbreviation列进行删除；找到使state的值为NaN的state/region，补上合适的值；找到并删除area(sq.mi)列中的缺失数据所在行；获取2010年的全民人口数据；计算各州的人口密度，排序并获取人口密度最高的州。

1.3 分析：州人口数据-州简称数据 1.3.1 导入数据 import numpy as np import pandas as pd from pandas import Series, DataFrame # 州人口数量表 popu_df = pd.read_csv('./state-population.csv') popu_df.head() ''' state/region ages year population 0 AL under18 2012 1117489.0 1 AL total 2012 4817528.0 2 AL under18 2010 1130966.0 3 AL total 2010 4785570.0 4 AL under18 2011 1125763.0 ''' # 州面积表 area_df = pd.read_csv('./state-areas.csv') area_df.head() ''' state area (sq. mi) 0 Alabama 52423 1 Alaska 656425 2 Arizona 114006 3 Arkansas 53182 4 California 163707 ''' # 州简称对照表 abbr_df = pd.read_csv('./state-abbrevs.csv') abbr_df.head() ''' state abbreviation 0 Alabama AL 1 Alaska AK 2 Arizona AZ 3 Arkansas AR 4 California CA ''' 1.3.2 合并数据：州人口数据-州简称数据

将州人口数据与州简称数据进行合并，删除重复列。

# 合并 popu_abbr_df = pd.merge(left=popu_df, right=abbr_df, left_on='state/region', right_on='abbreviation', how='outer') # 删除重复列abbreviation popu_abbr_df.drop(labels='abbreviation', axis=1, inplace=True) 1.3.3 缺失值处理

判断是否存在缺失值

popu_abbr_df.info() ''' Int64Index: 2544 entries, 0 to 2543 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 state 2448 non-null object 1 state/region 2544 non-null object 2 ages 2544 non-null object 3 year 2544 non-null int64 4 population 2524 non-null float64 dtypes: float64(1), int64(1), object(3) memory usage: 119.2+ KB '''

可以看出，数据一共有2544条，但state列和population列数据量小于2544条，因此这两列中存在缺失数据。

popu_abbr_df.isnull().any(axis=0) ''' state True state/region False ages False year False population True dtype: bool '''

观察缺失值

nan_df = popu_abbr_df.loc[popu_abbr_df.isnull().any(axis=1)] nan_df.head() ''' state state/region ages year population 2448 NaN PR under18 1990 NaN 2449 NaN PR total 1990 NaN 2450 NaN PR total 1991 NaN 2451 NaN PR under18 1991 NaN 2452 NaN PR total 1993 NaN ''' nan_df['state/region'].unique() # array(['PR', 'USA'], dtype=object)

说明简称PR和USA没有对应的全称。

pr_indexs = popu_abbr_df.loc[popu_abbr_df['state/region'] == 'PR'].index popu_abbr_df.loc[pr_indexs, 'state'] = 'Puerto Rico' usa_indexs = popu_abbr_df.loc[popu_abbr_df['state/region'] == 'USA'].index popu_abbr_df.loc[usa_indexs, 'state'] = 'United State'

删除population列的缺失数据所在行

nan_indexs = popu_abbr_df.loc[popu_abbr_df['population'].isnull()].index popu_abbr_df.drop(labels=nan_indexs, axis=0, inplace=True) popu_abbr_df.info() ''' popu_abbr_df.info() Int64Index: 2524 entries, 0 to 2543 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 state 2524 non-null object 1 state/region 2524 non-null object 2 ages 2524 non-null object 3 year 2524 non-null int64 4 population 2524 non-null float64 dtypes: float64(1), int64(1), object(3) memory usage: 118.3+ KB ''' 1.3.4 ages列分析

ages列中的数据选项

popu_abbr_df['ages'].unique() # array(['under18', 'total'], dtype=object)

查看ages列中不同选项出现的次数

popu_abbr_df['ages'].value_counts() ''' total 1262 under18 1262 Name: ages, dtype: int64 ''' 1.4 分析：州人口数据-州面积数据 1.4.1 合并数据：州人口数据-州面积数据 popu_area_df = pd.merge(left=popu_abbr_df, right=area_df, on='state', how='outer') popu_area_df.info() ''' Int64Index: 2524 entries, 0 to 2523 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 state 2524 non-null object 1 state/region 2524 non-null object 2 ages 2524 non-null object 3 year 2524 non-null int64 4 population 2524 non-null float64 5 area (sq. mi) 2476 non-null float64 dtypes: float64(2), int64(1), object(3) memory usage: 138.0+ KB ''' 1.4.2 缺失值处理 nan_indexs = popu_area_df.loc[popu_area_df['area (sq. mi)'].isnull()].index popu_area_df.drop(labels=nan_indexs, axis=0, inplace=True) popu_area_df.info() ''' Int64Index: 2476 entries, 0 to 2475 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 state 2476 non-null object 1 state/region 2476 non-null object 2 ages 2476 non-null object 3 year 2476 non-null int64 4 population 2476 non-null float64 5 area (sq. mi) 2476 non-null float64 dtypes: float64(2), int64(1), object(3) memory usage: 135.4+ KB ''' 1.4.3 找出2010年的全民人口数据

条件查询

popu_area_df.query('year == 2010 & ages == "total"') 1.4.4 人口密度排序

计算人口密度

popu_area_df['population density'] = popu_area_df['population'] / popu_area_df['area (sq. mi)']

人口密度排序(降序)

popu_area_df.sort_values(by='population density', axis=0, ascending=False) 2 Pandas高级操作 2.1 替换操作 replace df.replace(to_replace=原值, value=替换值)

替换操作可以作用于Series和DataFrame中。

2.1.1 单值替换 import numpy as np import pandas as pd from pandas import DataFrame df = DataFrame(data=np.random.randint(0, 10, size=(3, 6))) ''' 0 1 2 3 4 5 0 2 4 8 9 8 0 1 6 1 3 8 0 2 2 3 6 7 9 8 4 '''

普通替换：替换所有符合要求的元素。

re_df1 = df.replace(to_replace=0, value='zero') ''' 0 1 2 3 4 5 0 2 4 8 9 8 zero 1 6 1 3 8 zero 2 2 3 6 7 9 8 4 ''' 2.1.2 多值替换 re_df2 = df.replace(to_replace={6:'six', 2:'two'}) ''' 0 1 2 3 4 5 0 two 4 8 9 8 0 1 six 1 3 8 0 two 2 3 six 7 9 8 4 ''' 2.1.3 定列替换

将第3列的数字9替换为字符串nine。

re_df3 = df.replace(to_replace={3: 9}, value='nine') ''' 0 1 2 3 4 5 0 2 4 8 nine 8 0 1 6 1 3 8 0 2 2 3 6 7 nine 8 4 ''' 2.2 映射操作 map

映射：根据映射关系表，把一个元素和一个特定的标签或字符串绑定在一起，可以为元素提供不同的表现形式。 map方法是调用者的数据作为参数依次传入到映射字典或函数中，得到映射后的值。 map方法是Series的方法，只能被Series对象调用。

name_salary_dict = { 'name': ['Jay', 'Tom', 'Jay'], 'salary': [1000, 2000, 1000] } name_salary_df = DataFrame(data=name_salary_dict) ''' name salary 0 Jay 1000 1 Tom 2000 2 Jay 1000 '''

映射关系列表

mapping_dict = { 'Jay': '杰', 'Tom': '汤姆' } name_salary_df['name_chs'] = name_salary_df['name'].map(mapping_dict) ''' name salary name_chs 0 Jay 1000 杰 1 Tom 2000 汤姆 2 Jay 1000 杰 ''' 2.3 运算操作 map，apply

参考上表中的数据，工资超过1000的部分需要扣除50%的费用，计算每个人的薪资。

def calc_salary(salary): if salary >= 1000: salary = (salary - 1000) * 0.5 return salary 2.3.1 map

map方法将函数作用于调用者Series对象中的每个元素。 map方法只能被Series对象调用，只能传入一个参数。

name_salary_df['salary_real'] = name_salary_df['salary'].map(calc_salary) 2.3.2 apply Series apply支持传入多个参数。

工资超过基数base的部分需要扣除50%的费用，计算每个人的薪资。

def calc_salary(salary, base): if salary >= base: salary = (salary - base) * 0.5 return salary name_salary_df['salary_real'] = name_salary_df['salary'].apply(calc_salary, args=(500,)) ''' name salary name_chs salary_real 0 Jay 1000 杰 250.0 1 Tom 2000 汤姆 750.0 2 Jay 1000 杰 250.0 ''' DataFrame apply方法可以将函数作用于DataFrame中的每行或每列。

每人加工资100元。

name_salary_df.apply(lambda x: x + 100 if x.name in ['salary', 'salary_real'] else x, axis=0) ''' name salary name_chs salary_real 0 Jay 1100 杰 350.0 1 Tom 2100 汤姆 850.0 2 Jay 1100 杰 350.0 ''' 2.4 随机抽样 2.4.1 numpy.random.permutation

permutation方法用于打乱原来数据中元素的顺序。输入整数，返回打乱顺序的数组；输入数组或列表，返回打乱顺序的数组。

np.random.permutation(3) # array([0, 2, 1]) np.random.permutation(['a', 'b', 'c', 'd']) # array(['c', 'a', 'd', 'b'], dtype='

【本文地址】

公司简介

联系我们

3 Python数据分析 美国各州人口分析案例 Pandas高级操作 美国大选献金案例 matplotlib

3 Python数据分析美国各州人口分析案例 Pandas高级操作美国大选献金案例 matplotlib