python 给定URL 如何获取其内容，并将其保存至HTML文档。

2023-07-15 10:57| 来源: 网络整理| 查看: 265

一，获取URL的内容需要用到标准库urllib包，其中的request模块。

import urllib.request url='http://www.baidu.com' response=urllib.request.urlopen(url) string=response.read() html=string.decode('utf-8') print(html)

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

urlopen()方法返回一个，即标准库http包里的对象，该包是一个底层包，由request模块调用。

read()方法返回一个，字节对象是计算机认的，人看不懂。需要转成人看得懂的字符串。

字节对象转成str对象用str.decode()方法

二，将获取的str对象内容保存到HTML文件,需用到程序内置的方法open()

f=open('lc.html','w') f.write(html) f.close()

　　open()方法返回一个

　　write()方法是向文件对象写入str内容

　　最后要关闭文件对象

三，注：若上面的url换成http://www.baidu.com，则出现错误：

　　UnicodeEncodeError: 'gbk' codec can't encode character '\xbb' in position 29531: illegal multibyte sequence

原因分析：上面生成的lc.html用记事本打开，显示文件编码为ANSI，即gb2312编码。

(不同的国家和地区制定了不同的标准，由此产生了 GB2312, BIG5, JIS 等各自的编码标准。这些使用 2 个字节来代表一个字符的各种汉字延伸编码方式，称为 ANSI 编码。在简体中文系统下，ANSI 编码代表 GB2312 编码，在日文操作系统下，ANSI 编码代表 JIS 编码。)

如何以utf-8编码来存储lc.html文件呢？

f=open('lc.html','w',encoding='utf-8')　　

四，注：若上面的URL换成https://www.foxizy.com/v-5zoc-235f23.html，则出现错误：

　　UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

　　原因分析：服务器传过来的内容是压缩格式（gzip），浏览器能够自动解压缩，而程序不能。

　　下面来看下http应答包的header部分：

>>> response.getheaders()[('Server', 'nginx'), ('Date', 'Sun, 23 Jun 2019 00:25:46 GMT'), ('Content-Type', 'text/html; '), ('Transfer-Encoding', 'chunked'), ('Connection', 'close'), ('Cache-Control', 'public, max-age=252000'), ('Expires', 'Mon, 24 Jun 2019 07:14:39 GMT'), ('Last-Modified', 'Fri, 21 Jun 2019 09:14:39 GMT'), ('Content-Encoding', 'gzip'), ('N-Cache', 'HIT')]

从红色部分可以看出，服务器返回的内容经过了gzip压缩，所以需要解压缩。

如何解决该问题：

import zlib string=zlib.decompress(string,zlib.MAX_WBITS | 16)　

五，注：若urlopen()方法只传入一个url地址参数，则该HTTP请求的方法为GET请求。

如何进行POST请求呢？

from urllib import request,parseurl='http://httpbin.org/post'd={'name':'张三'} da=parse.urlencode(d) data=bytes(da,encoding='utf-8') response=request.urlopen(url,data=data)print(response.read().decode('utf-8'))

　　用了第二个参数data，就相当于post请求，但是data参数要求是字节（bytes）类型。

六，注：当我们想传递request headers的时候，urlopen就无法支持了，这里需要一个新的方法。

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

from urllib import request,parse url='http://httpbin.org/post' headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36', 'Host':'httpbin.org' } dict={'name':'zhangsan'} data=bytes(parse.urlencode(dict),encoding='utf-8') req=request.Request(url=url,data=data,headers=headers,method='post') response=request.urlopen(req) print(response.read().decode('utf-8'))

【本文地址】

公司简介

联系我们