pandas：sample函数解释

2023-12-11 02:16| 来源: 网络整理| 查看: 265

1、函数定义

2、作用：

3、举个栗子

4、参数解释

1、函数定义 DataFrame.sample(self: ~ FrameOrSeries, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None) 2、作用：

从所选的数据的指定 axis 上返回随机抽样结果，类似于random.sample()函数。

3、举个栗子

1、首先定义一个数据，结构如下：

import pandas as pd # 定义一组数据 df = pd.DataFrame({'num_legs': [2, 4, 8, 0], 'num_wings': [2, 0, 0, 0], 'num_specimen_seen': [10, 2, 1, 8]}, index=['falcon', 'dog', 'spider', 'fish']) print(df) """ -----------------结果-----------------""" num_legs num_wings num_specimen_seen falcon 2 2 10 dog 4 0 2 spider 8 0 1 fish 0 0 8

2、从Series df['num_legs']中随机提取3个元素。注意我们使用random_state（类似于random库中随机种子的作用）确保示例的可复现性。可以看出，结果是在上述数据的“num_legs”项中随机抽取三个。

extract = df['num_legs'].sample(n=3, random_state=1) print(extract) """------------运行结果----------""" fish 0 spider 8 falcon 2 Name: num_legs, dtype: int64

3、replace=True时表示有放回抽样，设置frac=0.5表示随机抽取50%的数据，默认对行数据进行操作。栗子如下。

extract2 = df.sample(frac=0.5, replace=True, random_state=1) print(extract2) """-----------运行结果-----------""" num_legs num_wings num_specimen_seen dog 4 0 2 fish 0 0 8

4、一个上采样的栗子。设置 frac=2。注意，当frac>1时必须设置replace=True，默认对行数据进行操作。

extract3 = df.sample(frac=2, replace=True, random_state=1) print(extract3) """-----------运行结构-----------""" num_legs num_wings num_specimen_seen dog 4 0 2 fish 0 0 8 falcon 2 2 10 falcon 2 2 10 fish 0 0 8 dog 4 0 2 fish 0 0 8 dog 4 0 2

5、使用数据中的某列的数据值作为权重的栗子。对num_availen_seen列数据进行操作，该列数据中值较大的行更容易被采样。可以看出，num_availen_seen列中的数据为[10, 2, 1, 8]，则[10, 8]两列更易被抽到。抽样结果即说明了这一点。

extract4 = df.sample(n=2, weights='num_specimen_seen', random_state=1) print(extract4) """------------运行结果------------""" num_legs num_wings num_specimen_seen falcon 2 2 10 fish 0 0 8 4、参数解释

n ：int, optional

随机抽样返回的items个数。当frac = None时不可用。

Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.

frac ：float, optional

要返回的 axis items 数量的小数(比例)表示。不能与n一起使用。

Fraction of axis items to return. Cannot be used with n.

Note：If frac > 1, replacement should be set to True.

replace ：bool, default False

是否是有放回取样。

Allow or disallow sampling of the same row more than once.

weights ：str or ndarray-like, optional

默认的“None”将导致相等的概率权重。如果传递了一个序列，将与目标对象上的索引对齐。权重中未被采样对象发现的索引值将被忽略，权重中未被采样对象的索引值将被赋值为零。如果在DataFrame上调用，将在axis = 0时接受列的名称。除非权重是一个序列，否则权重必须与被采样的轴长度相同。如果权重的和不是1，它们将被规范化为和为1。weights列中缺少的值将被视为零。不允许无限值。

Default ‘None’ results in equal probability weighting. If passed a Series, will align with target object on index. Index values in weights not found in sampled object will be ignored and index values in sampled object not in weights will be assigned weights of zero. If called on a DataFrame, will accept the name of a column when axis = 0. Unless weights are a Series, weights must be same length as axis being sampled. If weights do not sum to 1, they will be normalized to sum to 1. Missing values in the weights column will be treated as zero. Infinite values not allowed.

random_state ：int or numpy.random.RandomState, optional

用于随机数生成器(如果是int类型的参数)或numpy RandomState对象的种子。

Seed for the random number generator (if int), or numpy RandomState object.

axis ：{0 or ‘index’, 1 or ‘columns’, None}, default None

采样的轴。可以是axis的编号或名称。

Axis to sample. Accepts axis number or name. Default is stat axis for given data type (0 for Series and DataFrames).

Returns ：Series or DataFrame

与调用数据相同类型的新对象，包含从调用数据对象中随机取样的n项。

A new object of same type as caller containing n items randomly sampled from the caller object.

参考官网： https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html

【本文地址】

公司简介

联系我们