python爬取

2024-02-03 23:59| 来源: 网络整理| 查看: 265

导师给了个任务，在他所做的Web项目中爬取用户行为信息。

以前只爬取过百度的一些图片，还是比较简单的，一搜索也好多模板，但这次一做这个小任务才发现自己在这方面从来没深深研究过，有很多不足，爬取的内容、网站不一样，所需要的方法也不同。

Talk is cheap，show me the code.

先粘贴代码，然后再介绍：

import json import requests from selenium import webdriver import time import pandas as pd from bs4 import BeautifulSoup from lxml import etree from PIL import Image import urllib.request import re import os import html5lib from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By def loadPage(htmler,judge,page,part): soup = BeautifulSoup(htmler, "lxml") # 获取table td类型 tables = soup.find_all('table') # print(tables) # time_start = time.time() for i in range(len(tables)): df_tables = pd.read_html(str(tables[i])) for j in range(len(df_tables)): df = df_tables[j] # 如果还需要对数据进行处理，可以先不输出成表格。先保存下来，最后再遍历一遍再输出成表格 if judge==0 : csv_name = os.path.join('table', str(page)+ '_' +str(part-1)+ '_' +str(i) + '_' + str(j) + '.csv') else : csv_name = os.path.join('table', str(page)+ '_' +str(part-1)+ '_' +str(i) + '_' + str(j) + 'gen'+'.csv') df.to_csv(csv_name, index=False, header=False) # time_end = time.time() # print('time获取数据并生成表格的时间 cost', time_end - time_start, 's') # 获取table ul类型 count = 0 for li in soup.find_all(name='li'): count += 1 if (count >= 2 and count

【本文地址】

公司简介

联系我们