您所在的位置：网站首页 › python怎么把数字拆分 › 10

10

2024-06-10 03:02| 来源: 网络整理| 查看: 265

10_Pandas使用分隔符或正则表达式将字符串拆分为多列

Pandas如何将带有字符串元素的列拆分为多个列。

使用以下字符串的方法。

str.split（）：用定界符分割str.extract()：按正则表达式拆分

字符串方法是pandas.Series方法。

适用于pandas.Series或pandas.DataFrame列

str.split（）：用定界符分割

要按定界符（delimiter）进行拆分，使用字符串方法str.split（）。

pandas.Series

以以下pandas.Series为例。

import pandas as pd s_org = pd.Series(['[email protected]', '[email protected]', '[email protected]', 'ddd'], index=['A', 'B', 'C', 'D']) print(s_org) print(type(s_org)) # A [email protected] # B [email protected] # C [email protected] # D ddd # dtype: object #

将定界符指定为第一个参数。一个pandas.Series元素作为拆分字符串的列表返回。

s = s_org.str.split('@') print(s) print(type(s)) # A [aaa, xxx.com] # B [bbb, yyy.com] # C [ccc, zzz.com] # D [ddd] # dtype: object #

指定split = True作为参数可分为多个列并以pandas.DataFrame的形式获取。默认值为expand = False。

没有足够的行划分的元素为“无(None)”。

df = s_org.str.split('@', expand=True) print(df) print(type(df)) # 0 1 # A aaa xxx.com # B bbb yyy.com # C ccc zzz.com # D ddd None #

可以在列中指定获取的pandas.DataFrame的列名。

df.columns = ['local', 'domain'] print(df) # local domain # A aaa xxx.com # B bbb yyy.com # C ccc zzz.com # D ddd None pandas.DataFrame

如果要通过将pandas.DataFrame的特定列拆分为多列来更新它，这会有些乏味。可能有更好的方法。

以先前创建的pandas.DataFrame为例。

print(df) # local domain # A aaa xxx.com # B bbb yyy.com # C ccc zzz.com # D ddd None

在特定的列上使用str.split（）获得一个拆分的pandas.DataFrame。

print(df['domain'].str.split('.', expand=True)) # 0 1 # A xxx com # B yyy com # C zzz com # D None None

使用pd.concat（）与原始pandas.DataFrame进行串联（联接），并使用drop（）方法删除原始列。

df2 = pd.concat([df, df['domain'].str.split('.', expand=True)], axis=1).drop('domain', axis=1) print(df2) # local 0 1 # A aaa xxx com # B bbb yyy com # C ccc zzz com # D ddd None None

如果剩余的列很少，则只能选择与pd.concat（）串联（联接）时所需的列。

df3 = pd.concat([df['local'], df['domain'].str.split('.', expand=True)], axis=1) print(df3) # local 0 1 # A aaa xxx com # B bbb yyy com # C ccc zzz com # D ddd None None

要重命名特定的列，请使用rename（）方法。

df3.rename(columns={0: 'second_LD', 1: 'TLD'}, inplace=True) print(df3) # local second_LD TLD # A aaa xxx com # B bbb yyy com # C ccc zzz com # D ddd None None

参考文章

01_Pandas.DataFrame的行名和列名的修改 str.extract()：按正则表达式拆分

使用字符串方法str.extract（）分割正则表达式。

以以下pandas.Series为例。

import pandas as pd s_org = pd.Series(['[email protected]', '[email protected]', '[email protected]', 'ddd'], index=['A', 'B', 'C', 'D']) print(s_org) # A [email protected] # B [email protected] # C [email protected] # D ddd # dtype: object

在第一个参数中指定正则表达式。对于每个与正则表达式中用（）括起来的组部分匹配的字符串，均对其进行划分。

提取多个组时，无论参数expand如何，都将返回pandas.DataFrame。

如果不匹配，则为NaN。

df = s_org.str.extract('(.+)@(.+)\.(.+)', expand=True) print(df) # 0 1 2 # A aaa xxx com # B bbb yyy com # C ccc zzz com # D NaN NaN NaN df = s_org.str.extract('(.+)@(.+)\.(.+)', expand=False) print(df) # 0 1 2 # A aaa xxx com # B bbb yyy com # C ccc zzz com # D NaN NaN NaN

如果只有一组，则当参数expand = True时返回pandas.DataFrame，如果expand = False则返回pandas.Series。

df_single = s_org.str.extract('(\w+)', expand=True) print(df_single) print(type(df_single)) # 0 # A aaa # B bbb # C ccc # D ddd # s = s_org.str.extract('(\w+)', expand=False) print(s) print(type(s)) # A aaa # B bbb # C ccc # D ddd # dtype: object #

Expand = False是当前版本0.22.0中的默认值，但expand = True将是将来的默认值。

FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)

如果对正则表达式模式使用命名组（？P …），则该名称将按原样是列名。

df_name = s_org.str.extract('(?P.*)@(?P.*)\.(?P.*)', expand=True) print(df_name) # local second_LD TLD # A aaa xxx com # B bbb yyy com # C ccc zzz com # D NaN NaN NaN

如果要通过将pandas.DataFrame的特定列划分为多个列来进行更新，请参考上面的str.split（）示例。使用pd.concat（）连接（联接）原始的pandas.DataFrame并使用drop（）方法删除原始的列。

【本文地址】

公司简介

联系我们