doc,docx解析的那些事

#doc,docx解析的那些事| 来源: 网络整理| 查看: 265

本篇文章主要写一下如何对于doc,docx进行解析，以及解析其中的表格数据

没有做了解之前，我一直以为它们是同一种编码方式，只是在word里的表现是需要不同的word打开而已，等我仔细查阅文档以后发现真是naive啊。

doc，docx有什么不同的呢？

1.存储方式的不同[1]： doc 是二进制存储，docx是打包文件（知乎问题：

除了收费的软件或者库以外，如何解析doc格式word文件，C++或C#语言的？ - 知乎

），结构[2]大概如下：

├── [Content_Types].xml

├── _rels

|----.rels.xml

├── docProps │

├── app.xml

│── core.xml

└── word

├── _rels │

├── document.xml.rels

└── footnotes.xml.rels

├── document.xml -----------------存放文本的主要文件

├── fontTable.xml

├── footnotes.xml

├── media ------------------------存放docx文档里的图片、音频、视频等

├── numbering.xml

├── settings.xml

├── styles.xml

├── theme │

└── theme1.xml

└── webSettings.xml

上面的文档有些是必需的，有一些可有可无，具体可以查阅相关文档

既然是打包文件，我们可以使用python对其进行解析了,比如我们想获取其中的document.xml

import zipfile def parseZip(filepath): zipf = zipfile.Zipfile(filepath) return zipf.read("word/document.xml")

2.docx易于跨平台，主要是存储xml 等组成的打包文件

3.docx文档占用体积更小

4.docx对于处理一些复杂对象更得心应手，因为可以通过xml的配置进行对于比如公式、表格、图片等。

说了这么多，好像就第一条有用，那我们就按照上面说的进行解析。

我去发现好多包可以做这些事，比如python-docx,docx2txt,pythondocx等，但是这些都是对docx格式的文档进行，对于doc如何解析就有些束手无策了。不过没关系，我们可以把doc文档转化成docx文档。

import win32com from win32com.client import Dispatch,constants w = Dispatch('Word.Application') w.Visible = 0 w.DisplayAlerts=0 doc = w.Documents.Open("input.doc") doc.SaveAs("output.docx",FileFormat=12)

现在我们的数据格式都是相同的了。

那就要对于docx进行解析，并且获取到表格，综合对比了一下，发现 python-docx更方便一些。

首先我们对于输入的文档进行格式判断,定义函数judgeType()

其次是解析文档

最后把得到的表格写到excel里面。

代码如下：

#coding:utf-8 import win32com from win32com.client import Dispatch,constants import docx import xlwt import os import sys # 判断输入的文档格式 def judgeType(file_path): tmp_result = os.path.splitext(file_path) file_type = tmp_result[1] return file_type # 格式转换 def convertFormat(file_path): w = Dispatch('Word.Application') w.Visible = 0 w.DisplayAlerts=0 doc = w.Documents.Open(file_path) doc.SaveAs("tmp.docx", FileFormat=12) def main(): file_path = sys.argv[1] # print file_path file_type = judgeType(file_path) # print file_type if file_type=='.doc': convertFormat(file_path) doc = docx.Document("tmp.docx") book = xlwt.Workbook() tables = doc.tables for index,table in enumerate(tables): i = index+1 sheet = book.add_sheet("%dsheet"%i) for i_r, row in enumerate(table.rows): tmp_i = -1 for cell in row.cells: tmp_i=tmp_i+1 cell_data = [] for p in cell.paragraphs: cell_data.append(p.text) sheet.write(i_r, tmp_i, "\n".join(cell_data)) book.save(sys.argv[2]) os.remove("tmp.docx") if file_type=='.docx': doc = docx.Document(file_path) book = xlwt.Workbook() tables = doc.tables for index, table in enumerate(tables): i = index + 1 sheet = book.add_sheet("%dsheet" % i) for i_r, row in enumerate(table.rows): tmp_i = -1 for cell in row.cells: tmp_i = tmp_i + 1 cell_data = [] for p in cell.paragraphs: cell_data.append(p.text) sheet.write(i_r, tmp_i, "\n".join(cell_data)) book.save(sys.argv[2]) if __name__=='__main__': main()

至此，对于word中解析表格的方法我们已经构建完成

最后，对于表格的合并、分割以上方法需要调整~

参考文档：

[1]

Difference Between DOC and DOCX

[2]

https://geddy.cn/blog/item/jie-xi-docx

【本文地址】

公司简介

联系我们

今日新闻

推荐新闻

专题文章