Pandas

2023-03-16 15:55| 来源: 网络整理| 查看: 265

Users having pandas code in production and maintainers of libraries with pandas as a dependency are strongly recommended to run their test suites with the release candidate, and report any breaking change to our issue tracker before the official 2.0.0 release.

Pandas 在 20230222 发布了 V2.0.0 候选版本，大家可以通过：pip install pandas==2.0.0rc0 ，就可以安装其 pre-release 版本进行体验！

更多关于安装和配置信息请参考我们的推文《Pandas-2.0-尝鲜之旅-01》文章！

本次尝鲜之旅主要介绍以下内容：

依赖安装

速度测试

依赖安装

先来看下我们的测试代码，其中测试数据 data_for_test.csv 可以在公众号回复pandas2-测试数据获取下载链接，测试代码如下：

importpandasaspd #设定全局pandas返回的数据格式为pyarrow支持的格式 pd.options.mode.dtype_backend='pyarrow' #读取我们的约1.5G的csv文件，且指定使用engine='pyarrow'引擎 temp_df=pd.read_csv("./data_for_test.csv",engine='pyarrow',use_nullable_dtypes=True) #查看文件 print(temp_df.head()) #查看输出的数据类型 print(temp_df.dtypes)

当我们尝试在配置好的环境中运行该代码文件时，我们会碰到：

ImportError: Missing optional dependency 'pyarrow.csv'. Use pip or conda to install pyarrow.csv.

我们可以通过 pip install pyarrow 来安装相关的依赖，这样的错误提示，原因是我们还没有安装 pyarrow 这个依赖，导致我们想使用 Arrow 项目的数据类型时会报错，主要是在 engine='pyarrow' 我们选择引擎为 pyarrow，这时我们的 read_csv 函数会从 pandas 默认引擎（返回的是 Numpy 支持的 nullable 数据类型）切换为 pyarrow 来读取 csv 数据（返回 pyarrow 支持的 nullable 数据类型，即ArrowDtype），同时我们也可以设置 use_nullable_dtypes=True 来允许输出 pd.NA 的空值，当然我们也可以通过上下文管理器来进行局部参数的设置：

importpandasaspd #利用pandas引擎读取csv文件，此处将mode.dtype_backend的参数设置为pandas，则使用pandas的默认引擎 withpd.option_context("mode.dtype_backend","pandas"): temp_df=pd.read_csv("./data_for_test.csv",use_nullable_dtypes=True) #查看输出的数据类型 print(temp_df.dtypes) >>>temp_df.dtypes symbolInt64 namestring[python] datestring[python] openFloat64 closeFloat64 highFloat64 lowFloat64 volumeInt64 amountFloat64 zfFloat64 zdfFloat64 zdeFloat64 turnoverFloat64 created_timestring[python] dtype:object #利用pyarrow引擎读取csv文件，此处将mode.dtype_backend的参数设置为pyarrow，则使用pyarrow引擎 withpd.option_context("mode.dtype_backend","pyarrow"): temp_df=pd.read_csv("./data_for_test.csv",use_nullable_dtypes=True) #查看输出的数据类型 print(temp_df.dtypes) >>>temp_df.dtypes symbolint64[pyarrow] namestring[pyarrow] datestring[pyarrow] opendouble[pyarrow] closedouble[pyarrow] highdouble[pyarrow] lowdouble[pyarrow] volumeint64[pyarrow] amountdouble[pyarrow] zfdouble[pyarrow] zdfdouble[pyarrow] zdedouble[pyarrow] turnoverdouble[pyarrow] created_timestring[pyarrow] dtype:object 速度测试

本次测试并不严谨，仅作为日常使用的参考，将采用最为常用的 time.perf_counter() 来进行简单的文件读取速度对比，同时读取约 1.5G 的 csv 文件，pandas 默认引擎为 17 秒，而采用 pyarrow 引擎则为 1 秒左右！

importpandasaspd importtime start_time=time.perf_counter() withpd.option_context("mode.dtype_backend","pandas"): temp_df=pd.read_csv("./data_for_test.csv",use_nullable_dtypes=True) end_time=time.perf_counter() print(end_time-start_time) >>>17.37284319999162 start_time=time.perf_counter() withpd.option_context("mode.dtype_backend","pyarrow"): temp_df=pd.read_csv("./data_for_test.csv",engine="pyarrow",use_nullable_dtypes=True) end_time=time.perf_counter() print(end_time-start_time) >>>1.085339599987492 下期预告

更多尝鲜测试代码，请关注下期【Pandas-2.0-尝鲜之旅（三）】将给大家演示如何在 pandas 2.0.0 中利用 Copy-on-Write 特性来提高数据操作性能。

参考资料

参考文章提供更多阅读材料，欢迎各位小伙伴阅读！

https://www.python.org/downloads/

https://docs.python.org/3/whatsnew/3.11.html

https://pandas.pydata.org/docs/whatsnew/v1.4.0.html#backwards-incompatible-api-changes

https://pypi.org/project/pandas/2.0.0rc0/#history

https://pandas.pydata.org/docs/dev/getting_started/install.html

https://medium.com/@darshilp/pandas-2-0-is-here-427b026ab913

https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i

https://medium.com/@darshilp/pandas-2-0-is-here-427b026ab913

【本文地址】

公司简介

联系我们