Loading... ## 官方文档 [https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpproxy](https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpproxy)  文档说:**为Request对象设置`prxoy`元数据开启代理** 即: ```python request.meta['proxy'] = 'http://127.0.0.1:7880' ``` 那么,我们可以在哪里设置代理呢? ## 设置代理 ### 方法一:直接为request对象设置 在构建`Request`对象时,为其`meta`设置值 ```python class ExampleSpider(scrapy.Spider): name = 'example' allowed_domains = ['www.ip.cn'] start_urls = ['https://www.ip.cn/api/index?ip=&type=0'] def start_requests(self): for url in self.start_urls: yield scrapy.Request(url=url, method="get", callback=self.parse_start_url, meta={'proxy': 'http://49.82.146.124:888'}, dont_filter=True) def parse_start_url(self, response): print(response.text) def parse(self, response: TextResponse): print(response.text) ``` ### 方法二:使用中间件自动设置 Scrapy的内置中间件`HttpProxyMiddleware`,大致可以理解为判断`request`对象有没有设置代理的需求(request的meta是否有`proxy`键),有的话,则验证代理(如果代理有账号密码)并设置。  那么,我们编写的中间件,就是告诉内置中间件,“我有设置代理的需求,请给我设置代理!” 在request对象经过我们自定义的中间件时,为其加上“代理需求”。然后到内置中间件`HttpProxyMiddleware`就会被成功设置上代理了。 代码示例: ```python class ProxyMiddleware(object): def get_proxies(self) -> dict: """ 获取代理 :return: 包含http、https的代理 :rtype: dict """ pass def process_request(self, request, spider): proxies = self.get_proxies() print(proxies) if request.url.startswith("http://"): request.meta['proxy'] = proxies['http'] elif request.url.startswith("https://"): request.meta['proxy'] = proxies['https'] return None ``` 最后激活中间件 > 中间件激活优先级须知: > > 请求时,值越小,越先执行 > > 响应时,值越大,越先执行 > > 值为None,不激活  内置中间件的优先级数值 ```python {'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560, 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700, 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400, 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350, 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300, 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750, 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580, 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600, 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550, 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, 'scrapy.downloadermiddlewares.stats.DownloaderStats': 850, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500} ``` 在上方代理的内置中间件优先级是750,刚才说了,我们自定义的中间件是先提出“代理需求”,内置中间件再去根据需求进行验证并设置代理。也就是我们自定义代理中间件的优先级值不能超过750。 `settings.py`激活中间件 ```python DOWNLOADER_MIDDLEWARES = { 'origin.middlewares.ProxyMiddleware': 543, } ``` Last modification:December 10th, 2020 at 06:25 pm © 允许规范转载