用C#爬虫来抓取网页并解析

您所在的位置：网站首页 › 百度的爬虫解析工具 › 用C#爬虫来抓取网页并解析

用C#爬虫来抓取网页并解析

2024-07-05 09:51| 来源: 网络整理| 查看: 265

C# 网络爬虫整理一、前言

在学了C#的网络爬虫之后，深感C#的强大，和爬虫的有趣，在这里将自己的心得体会记下来，以便日后的学习和回顾。这里有两个程序，首先是一个简单的抓取网页程序，将网页抓取下来之后用正则表达式进行解析，从而得到相应的信息。

二、C#网页爬虫

这里有几个要点:

第一：抓取的源

当我们想抓网页，首先就要知道该网页的具体内容，包含的主要信息，之后我们对信息进行处理，可以确定我们要抓取的网页数量，有两种抓取方法，一种是深度优先，一种是广度优先，最终抓取出所有自己想要的网页。

第二，对源的处理

我们知道我们抓取的是整个网页，有很多信息是我们不需要的，因此有两种处理方法，一种是用正则表达式来处理，另外一种是对DOM（Document Object Model）结构的数据用XPath函数来处理.

第三：处理过程中我们要用到线程

也就是异步任务，因此我们需要对其进行学习，理解async和await这一对孪生兄弟的用法。

1 a) 只有在async方法里面才能使用await操作符； 2 b) await操作符是针对Task对象的； 3 c) 当方法A调用方法B,方法B方法体内又通过await调用方法C时，如果方法C内部有异步操作，则方法B会等待异步操作执行完，才往下执行；但方法A可以继续往下执行，不用再等待B方法执行完。

1 static void Main(string[] args) 2 3 { 4 5 Test(); 6 7 Console.WriteLine("Test End!"); 8 9 Console.ReadLine(); 10 11 } 12 13 static async void Test() 14 15 { 16 17 await Test1(); 18 19 Console.WriteLine("Test1 End!"); 20 21 } 22 23 static Task Test1() 24 25 { 26 27 Thread.Sleep(1000); 28 29 Console.WriteLine("create task in test1"); 30 31 return Task.Run(() => 32 33 { 34 35 Thread.Sleep(3000); 36 37 Console.WriteLine("Test1"); 38 39 }); 40 41 } View Code

相当于代码：

1 static void Main(string[] args) 2 3 { 4 5 Test(); 6 7 Console.WriteLine("Test End!"); 8 9 Console.ReadLine(); 10 11 } 12 13 static void Test() 14 15 { 16 17 var test1=Test1(); 18 19 Task.Run(() => 20 21 { 22 23 test1.Wait(); 24 25 Console.WriteLine("Test1 End!"); 26 27 }); 28 29 } 30 31 static Task Test1() 32 33 { 34 35 Thread.Sleep(1000); 36 37 Console.WriteLine("create task in test1"); 38 39 return Task.Run(() => 40 41 { 42 43 Thread.Sleep(3000); 44 45 Console.WriteLine("Test1"); 46 47 }); 48 49 } View Code 第四：在C#中大量的出现lambda表达式，我们要对其有深刻的理解和认识。

比如：

1 cityCrawler.OnStart += (s, e) => 2 { 3 Console.WriteLine("爬虫开始抓取地址：" + e.Uri.ToString()); 4 };

我们只有深刻的认识了lambda表达式，才能更好的使用和理解它。

第五：

我们的爬虫是怎么伪造浏览器来进行抓包的，如果大量的抓包会被服务器警觉，我们要采用代理来解决这一问题。

第六：

对EventHandler的认识，它的构造有两个参数，一个是当前的上下文，一个是具体的对象（这个对象是我们自己创建的，在该委托的模板中进行传递）。

第七：并发处理。 1 Parallel.For(0, 2, (i) => 2 { 3 var hotel = hotelList[i]; 4 hotelCrawler.Start(hotel.Uri); 5 });

而For函数的定义如下：

1 // 摘要: 2 3 // 执行 for（在 Visual Basic 中为 For）循环，其中可能会并行运行迭代。 4 5 // 6 7 // 参数: 8 9 // fromInclusive: 10 11 // 开始索引（含）。 12 13 // 14 15 // toExclusive: 16 17 // 结束索引（不含）。 18 19 // 20 21 // body: 22 23 // 将为每个迭代调用一次的委托。 24 25 // 26 27 // 返回结果: 28 29 // 包含有关已完成的循环部分的信息的结构。 30 31 public static ParallelLoopResult For(int fromInclusive, int toExclusive, Action body); View Code 第八：计时函数 var watch = newStopwatch(); watch.Start(); 。。。。 watch.Stop() var milliseconds = watch.ElapsedMilliseconds; 第九：伪造浏览器

1 public async Task Start(Uri uri,string proxy=null) 2 3 { 4 5 return await Task.Run(() => 6 7 { 8 9 var pageSource = string.Empty; 10 11 try 12 13 { 14 15 if (this.OnStart != null) this.OnStart(this, newOnStartEventArgs(uri)); 16 17 var watch = newStopwatch(); 18 19 watch.Start(); 20 21 var request = (HttpWebRequest)WebRequest.Create(uri); 22 23 request.Accept = "*/*"; 24 25 request.ServicePoint.Expect100Continue = false;//加快载入速度 26 27 request.ServicePoint.UseNagleAlgorithm = false;//禁止Nagle算法加快载入速度 28 29 request.AllowWriteStreamBuffering = false;//禁止缓冲加快载入速度 30 31 request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");//定义gzip压缩页面支持 32 33 request.ContentType = "application/x-www-form-urlencoded";//定义文档类型及编码 34 35 request.AllowAutoRedirect = false;//禁止自动跳转 36 37 //设置User-Agent，伪装成Google Chrome浏览器 38 39 request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"; 40 41 request.Timeout = 5000;//定义请求超时时间为5秒 42 43 request.KeepAlive = true;//启用长连接 44 45 request.Method = "GET";//定义请求方式为GET 46 47 if (proxy != null)request.Proxy = newWebProxy(proxy);//设置代理服务器IP，伪装请求地址 48 49 request.CookieContainer = this.CookiesContainer;//附加Cookie容器 50 51 request.ServicePoint.ConnectionLimit = int.MaxValue;//定义最大连接数 52 53 using (var response = (HttpWebResponse)request.GetResponse()) {//获取请求响应 54 55 foreach (Cookie cookie in response.Cookies) this.CookiesContainer.Add(cookie); 56 57 //将Cookie加入容器，保存登录状态 58 59 if (response.ContentEncoding.ToLower().Contains("gzip"))//解压 60 61 { 62 63 using (GZipStream stream = newGZipStream(response.GetResponseStream(), CompressionMode.Decompress)) 64 65 { 66 67 using (StreamReader reader = newStreamReader(stream, Encoding.UTF8)) 68 69 { 70 71 pageSource = reader.ReadToEnd(); 72 73 } 74 75 } 76 77 } 78 79 elseif (response.ContentEncoding.ToLower().Contains("deflate"))//解压 80 81 { 82 83 using (DeflateStream stream = newDeflateStream(response.GetResponseStream(), CompressionMode.Decompress)) 84 85 { 86 87 using (StreamReader reader = newStreamReader(stream, Encoding.UTF8)) 88 89 { 90 91 pageSource = reader.ReadToEnd(); 92 93 } 94 95 } 96 97 } 98 99 else 100 101 { 102 103 using (Stream stream = response.GetResponseStream())//原始 104 105 { 106 107 using (StreamReader reader = newStreamReader(stream, Encoding.UTF8)) 108 109 { 110 111 112 113 pageSource= reader.ReadToEnd(); 114 115 } 116 117 } 118 119 } 120 121 } 122 123 request.Abort(); 124 125 watch.Stop(); 126 127 var threadId = System.Threading.Thread.CurrentThread.ManagedThreadId;//获取当前任务线程ID 128 129 var milliseconds = watch.ElapsedMilliseconds;//获取请求执行时间 130 131 if (this.OnCompleted != null) this.OnCompleted(this, newOnCompletedEventArgs(uri, threadId, milliseconds, pageSource)); 132 133 } 134 135 catch (Exception ex) 136 137 { 138 139 if (this.OnError != null) this.OnError(this, newOnErrorEventArgs(uri,ex)); 140 141 } 142 143 return pageSource; 144 145 }); 146 147 } View Code

这是我们爬虫的主体部分，我们伪造浏览器，设置好一定的参数，进行访问服务器，得到结果然后解析结果，并显示。整个过程是非常恰当的。

第十：对Task状态的掌控。 1 public OnCompletedEventArgs(Uri uri, int threadId, long milliseconds, string pageSource) 2 3 { 4 5 this.Uri = uri; 6 7 this.ThreadId = threadId; 8 9 this.Milliseconds = milliseconds; 10 11 this.PageSource = pageSource; 12 13 } 14 15 public OnErrorEventArgs(Uri uri,Exception exception) 16 17 { 18 19 this.Uri = uri; 20 21 this.Exception = exception; 22 23 } 24 25 public OnStartEventArgs(Uri uri) 26 27 { 28 29 this.Uri = uri; 30 31 }

我们有三种状态，起始态，完成态，出错态。并且将它们扩展为委托事件，在程序中使用，非常的抽象和方便。

1 public eventEventHandler OnStart;//爬虫启动事件 2 public eventEventHandler OnCompleted;//爬虫完成事件 3 public eventEventHandler OnError;//爬虫出错事件

1 cityCrawler.OnStart += (s, e) => 2 3 { 4 5 Console.WriteLine("爬虫开始抓取地址：" + e.Uri.ToString()); 6 7 }; 8 9 cityCrawler.OnError += (s, e) => 10 11 { 12 13 Console.WriteLine("爬虫抓取出现错误：" + e.Uri.ToString() + "，异常消息：" + e.Exception.Message); 14 15 }; 16 17 cityCrawler.OnCompleted += (s, e) => 18 19 { 20 21 //使用正则表达式清洗网页源代码中的数据 22 23 var links = Regex.Matches(e.PageSource, @"]+href=""*(?/hotel/[^>\s]+)""\s*[^>]*>(?(?!.*img).*?)", RegexOptions.IgnoreCase); 24 25 foreach (Match match in links) 26 27 { 28 29 var city = newCity 30 31 { 32 33 CityName = match.Groups["text"].Value, 34 35 Uri = newUri("http://hotels.ctrip.com" + match.Groups["href"].Value 36 37 ) 38 39 }; 40 41 if (!cityList.Contains(city)) cityList.Add(city);//将数据加入到泛型列表 42 43 Console.WriteLine(city.CityName + "|" + city.Uri);//将城市名称及URL显示到控制台 44 45 } 46 47 Console.WriteLine("==============================================="); 48 49 Console.WriteLine("爬虫抓取任务完成！合计 " + links.Count + " 个城市。"); 50 51 Console.WriteLine("耗时：" + e.Milliseconds + "毫秒"); 52 53 Console.WriteLine("线程：" + e.ThreadId); 54 55 Console.WriteLine("地址：" + e.Uri.ToString()); 56 57 }; View Code 第十一：代理服务器和测试服务器。 //测试代理IP是否生效：http://1212.ip138.com/ic.asp //测试当前爬虫的User-Agent：http://www.whatismyuseragent.net 三、加强版的网络爬虫

在简单版的基础上，这次我们不是直接伪造浏览器上网了，而是使用相应的工具来帮助我们进行网页解析。

首先我们需要四个DLL:

其次我们还是先定义一个接口类：

1 public interface ICrawler 2 3 { 4 5 eventEventHandler OnStart;//爬虫启动事件 6 7 eventEventHandler OnCompleted;//爬虫完成事件 8 9 eventEventHandler OnError;//爬虫出错事件 10 11 Task Start(Uri uri, Script script, Operation operation); //启动爬虫进程 12 13 }

然后，我们需要用到PlantomJS和Selenium这两个工具，前者是用来对webkit进行渲染，后者是用来自动化测试，让服务器感觉到就像是真人一样的在操作网页。

1 private PhantomJSOptions _options;//定义PhantomJS内核参数 2 3 private PhantomJSDriverService _service;//定义Selenium驱动配置 4 5 public StrongCrawler(string proxy = null) 6 7 { 8 9 this._options = newPhantomJSOptions();//定义PhantomJS的参数配置对象 10 11 this._service = PhantomJSDriverService.CreateDefaultService(Environment.CurrentDirectory); 12 13 //初始化Selenium配置，传入存放phantomjs.exe文件的目录 14 15 _service.IgnoreSslErrors = true;//忽略证书错误 16 17 _service.WebSecurity = false;//禁用网页安全 18 19 _service.HideCommandPromptWindow = true;//隐藏弹出窗口 20 21 _service.LoadImages = false;//禁止加载图片 22 23 _service.LocalToRemoteUrlAccess = true;//允许使用本地资源响应远程 URL 24 25 _options.AddAdditionalCapability(@"phantomjs.page.settings.userAgent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"); 26 27 if (proxy != null) 28 29 { 30 31 _service.ProxyType = "HTTP";//使用HTTP代理 32 33 _service.Proxy = proxy;//代理IP及端口 34 35 } 36 37 else 38 39 { 40 41 _service.ProxyType = "none";//不使用代理 42 43 } 44 45 }

之后就该我们的主线程了Task异步操作：

1 public async Task Start(Uri uri,Script script, Operation operation) 2 3 { 4 5 awaitTask.Run(() => 6 7 { 8 9 if (OnStart != null) this.OnStart(this, newOnStartEventArgs(uri)); 10 11 var driver = newPhantomJSDriver(_service, _options);//实例化PhantomJS的WebDriver 12 13 try 14 15 { 16 17 var watch = DateTime.Now; 18 19 driver.Navigate().GoToUrl(uri.ToString());//请求URL地址 20 21 if (script != null) driver.ExecuteScript(script.Code, script.Args); 22 23 //执行Javascript代码 24 25 if (operation.Action != null) operation.Action.Invoke(driver); 26 27 var driverWait = newWebDriverWait(driver, TimeSpan.FromMilliseconds(operation.Timeout));//设置超时时间为x毫秒 28 29 if (operation.Condition != null) driverWait.Until(operation.Condition); 30 31 var threadId = System.Threading.Thread.CurrentThread.ManagedThreadId; 32 33 //获取当前任务线程ID 34 35 var milliseconds = DateTime.Now.Subtract(watch).Milliseconds; 36 37 //获取请求执行时间; 38 39 var pageSource = driver.PageSource;//获取网页Dom结构 40 41 this.OnCompleted.Invoke(this, newOnCompletedEventArgs(uri, threadId, milliseconds, pageSource, driver)); 42 43 } 44 45 catch (Exception ex) 46 47 { 48 49 this.OnError.Invoke(this, newOnErrorEventArgs(uri, ex)); 50 51 } 52 53 finally 54 55 { 56 57 driver.Close(); 58 59 driver.Quit(); 60 61 } 62 63 }); 64 65 }

在这里与简单的不同，首先我们没有返回值，其次，我们用了驱动PhantomJSDriver来代替我们自己构造的http请求,也进行了一定的参数设置。其次我们的事件都用了Invoke（）方法来调用，它的作用是让主线程来执行相应的操作，从而避免死锁。

Phantom JS是一个服务器端的 JavaScript API 的 WebKit。其支持各种Web标准： DOM 处理, CSS 选择器, JSON, Canvas, 和 SVG.

selenium官方加上第三方宣布支持的驱动有很多种；除了PC端的浏览器之外，还支持iphone、Android的driver.

PC端的driver都是基于浏览器的，主要分为2种类型：

一种是真实的浏览器driver

比如：safari、ff都是以插件形式驱动浏览器本身的；ie、chrome都是通过二进制文件来驱动浏览器本身的；这些driver都是直接启动并通过调用浏览器的底层接口来驱动浏览器的，因此具有最真实的用户场景模拟，主要用于进行web的兼容性测试使用。

一种是伪浏览器driver

selenium支持的伪浏览器包括htmlunit、PhantomJS；他们都不是真正的在浏览器、都没有GUI，而是具有支持html、js等解析能力的类浏览器程序；这些程序不会渲染出网页的显示内容，但是支持页面元素的查找、JS的执行等；由于不进行css及GUI渲染，所以运行效率上会比真实浏览器要快很多，主要用在功能性测试上面。

htmlunit是Java实现的类浏览器程序，包含在selenium server中，无需驱动，直接实例化即可；其js的解析引擎是Rhino.

PhantomJS是第三方的一个独立类浏览器应用，可以支持html、js、css等执行；其驱动是Ghost driver在1.9.3版本之后已经打包进了主程序中，因此只要下载一个主程序即可；其js的解析引擎是chrome 的V8。

再来看主函数，这里我们定义了一个Operation类，为的就是模拟正常人的操作，让selenium来执行。

1 var operation = newOperation 2 3 { 4 5 Action = (x) => { 6 7 //通过Selenium驱动点击页面的“酒店评论” 8 9 x.FindElement(By.XPath("//*[@id='commentTab']")).Click(); 10 11 }, 12 13 Condition = (x) => { 14 15 //判断Ajax评论内容是否已经加载成功 16 17 return x.FindElement(By.XPath("//*[@id='commentList']")).Displayed && x.FindElement(By.XPath("//*[@id='hotel_info_comment']/div[@id='commentList']")).Displayed && !x.FindElement(By.XPath("//*[@id='hotel_info_comment']/div[@id='commentList']")).Text.Contains("点评载入中"); 18 19 }, 20 21 Timeout = 5000 22 23 };

最后是解析方法：

1 private static void HotelCrawler(OnCompletedEventArgs e) { 2 3 //Console.WriteLine(e.PageSource); 4 5 //File.WriteAllText(Environment.CurrentDirectory + "\\cc.html", e.PageSource, Encoding.UTF8); 6 7 var hotelName = e.WebDriver.FindElement(By.XPath("//*[@id='J_htl_info']/div[@class='name']/h2[@class='cn_n']")).Text; 8 9 var address = e.WebDriver.FindElement(By.XPath("//*[@id='J_htl_info']/div[@class='adress']")).Text; 10 11 var price = e.WebDriver.FindElement(By.XPath("//*[@id='div_minprice']/p[1]")).Text; 12 13 var score = e.WebDriver.FindElement(By.XPath("//*[@id='divCtripComment']/div[1]/div[1]/span[3]/span")).Text; 14 15 var reviewCount = e.WebDriver.FindElement(By.XPath("//*[@id='commentTab']/a")).Text; 16 17 var comments = e.WebDriver.FindElement(By.XPath("//*[@id='hotel_info_comment']/div[@id='commentList']/div[1]/div[1]/div[1]")); 18 19 var currentPage =Convert.ToInt32(comments.FindElement(By.XPath("div[@class='c_page_box']/div[@class='c_page']/div[contains(@class,'c_page_list')]/a[@class='current']")).Text); 20 21 var totalPage = Convert.ToInt32(comments.FindElement(By.XPath("div[@class='c_page_box']/div[@class='c_page']/div[contains(@class,'c_page_list')]/a[last()]")).Text); 22 23 var messages = comments.FindElements(By.XPath("div[@class='comment_detail_list']/div")); 24 25 var nextPage = Convert.ToInt32(comments.FindElement(By.XPath("div[@class='c_page_box']/div[@class='c_page']/div[contains(@class,'c_page_list')]/a[@class='current']/following-sibling::a[1]")).Text); 26 27 Console.WriteLine(); 28 29 Console.WriteLine("名称：" + hotelName); 30 31 Console.WriteLine("地址：" + address); 32 33 Console.WriteLine("价格：" + price); 34 35 Console.WriteLine("评分：" + score); 36 37 Console.WriteLine("数量：" + reviewCount); 38 39 Console.WriteLine("页码：" + "当前页（" + currentPage + "）" + "下一页（" + nextPage + "）" + "总页数（" + totalPage + "）" + "每页（" + messages.Count + "）"); 40 41 Console.WriteLine(); 42 43 Console.WriteLine("==============================================="); 44 45 Console.WriteLine(); 46 47 Console.WriteLine("点评内容："); 48 49 foreach (var message in messages) 50 51 { 52 53 Console.WriteLine("帐号：" + message.FindElement(By.XPath("div[contains(@class,'user_info')]/p[@class='name']")).Text); 54 55 Console.WriteLine("房型：" + message.FindElement(By.XPath("div[@class='comment_main']/p/a")).Text); 56 57 Console.WriteLine("内容：" + message.FindElement(By.XPath("div[@class='comment_main']/div[@class='comment_txt']/div[1]")).Text.Substring(0,50) + "...."); 58 59 Console.WriteLine(); 60 61 Console.WriteLine(); 62 63 } 64 65 Console.WriteLine(); 66 67 Console.WriteLine("==============================================="); 68 69 Console.WriteLine("地址：" + e.Uri.ToString()); 70 71 Console.WriteLine("耗时：" + e.Milliseconds + "毫秒"); 72 73 }

可以看到我们使用了PlantomJS+Selenium来解析DOM数据最终得到相应的结果数据。

其实，用C#虽然功能方便，调试清楚，可是还是有一些不足的，比如代码冗长，实现抓取网页需要大量的工作量，下次我们将使用天生的抓包工具Python来模拟抓包。

【本文地址】

公司简介

联系我们