1. Scrapy对接Selenium

Scrapy抓取页面的方式和requests库类似,都是直接模拟HTTP请求,而Scrapy也不能抓取JavaScript动态谊染的页面。在前面的博客中抓取JavaScript渲染的页面有两种方式。一种是分析Ajax请求,找到其对应的接口抓取,Scrapy同样可以用此种方式抓取。另一种是直接用 Selenium模拟浏览器进行抓取,我们不需要关心页面后台发生的请求,也不需要分析渲染过程,只需要关心页面最终结果即可,可见即可爬。那么,如果Scrapy可以对接Selenium,那 Scrapy就可以处理任何网站的抓取了。

1.1 新建项目

首先新建项目,名为scrapyseleniumtest。

SRE实战 互联网时代守护先锋,助力企业售后服务体系运筹帷幄!一键直达领取阿里云限量特价优惠。

scrapy startproject scrapyseleniumtest

新建一个Spider。

scrapy genspider jd www.jd.com

修改ROBOTSTXT_OBEY为False。

ROBOTSTXT_OBEY = False

爬虫(十七):Scrapy框架(四) 对接selenium爬取京东商品数据 Python 第1张

1.2 定义Item

这里我们就不调用Item了。

初步实现Spider的start _requests()方法。

# -*- coding: utf-8 -*-
from scrapy import Request,Spider
from urllib.parse import quote
from bs4 import BeautifulSoup

class JdSpider(Spider):
    name = 'jd'
    allowed_domains = ['www.jd.com']
    base_url = 'https://search.jd.com/Search?keyword='

    def start_requests(self):
        for keyword in self.settings.get('KEYWORDS'):
            for page in range(1, self.settings.get('MAX_PAGE') + 1):
                url = self.base_url + quote(keyword)
                # dont_filter = True  不去重
                yield Request(url=url, callback=self.parse, meta={'page': page}, dont_filter=True)

首先定义了一个base_url,即商品列表的URL,其后拼接一个搜索关键字就是该关键字在京东搜索的结果商品列表页面。

关键字用KEYWORDS标识,定义为一个列表。最大翻页页码用MAX_PAGE表示。它们统一定义在settings.py里面。

KEYWORDS = ['iPad']
MAX_PAGE = 2

在start_requests()方法里,我们首先遍历了关键字,遍历了分页页码,构造并生成Request。由于每次搜索的URL是相同的,所以分页页码用meta参数来传递,同时设置dont_filter不去重。这样爬虫启动的时候,就会生成每个关键字对应的商品列表的每一页的请求了。

1.3 对接Selenium

接下来我们需要处理这些请求的抓取。这次我们对接Selenium进行抓取,采用Downloader Middleware来实现。在Middleware中对接selenium,输出源代码之后,构造htmlresponse对象,直接返回给spider解析页面,提取数据,并且也不在执行下载器下载页面动作。

class SeleniumMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def __init__(self,timeout=None):
        self.logger=getLogger(__name__)
        self.timeout = timeout
        self.browser = webdriver.Chrome()
        self.browser.set_window_size(1400,700)
        self.browser.set_page_load_timeout(self.timeout)
        self.wait = WebDriverWait(self.browser,self.timeout)

    def __del__(self):
        self.browser.close()

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'))

    def process_request(self, request, spider):
        '''
        在下载器中间件中对接使用selenium,输出源代码之后,构造htmlresponse对象,直接返回
        给spider解析页面,提取数据
        并且也不在执行下载器下载页面动作
        htmlresponse对象的文档:
        :param request:
        :param spider:
        :return:
        '''

        print('PhantomJS is Starting')
        page = request.meta.get('page', 1)
        self.wait = WebDriverWait(self.browser, self.timeout)
        # self.browser.set_page_load_timeout(30)
        # self.browser.set_script_timeout(30)
        try:
            self.browser.get(request.url)
            if page > 1:
                input = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > input')))
                input.clear()
                input.send_keys(page)
                time.sleep(5)

                # 将网页中输入跳转页的输入框赋值给input变量 EC.presence_of_element_located,判断输入框已经被加载出来
                input = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > input')))
                # 将网页中调准页面的确定按钮赋值给submit变量,EC.element_to_be_clickable 判断此按钮是可点击的
                submit = self.wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > a')))
                input.clear()
                input.send_keys(page)
                submit.click()  # 点击按钮
                time.sleep(5)

                # 判断当前页码出现在了输入的页面中,EC.text_to_be_present_in_element 判断元素在指定字符串中出现
                self.wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.curr'),str(page)))
                # 等待 #J_goodsList 加载出来,为页面数据,加载出来之后,在返回网页源代码
                self.wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.curr'),str(page)))
            return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf-8',status=200)
        except TimeoutException:
            return HtmlResponse(url=request.url, status=500, request=request)

首先我在__init__()里对一些对象进行初始化,包括WebDriverWait等对象,同时设置页面大小和页面加载超时时间。在process_request()方法中,我们通过Request的meta属性获取当前需要爬取的页码,将页码赋值给input变量,再将翻页的点击按钮框赋值给submit变量,然后在数据框中输入页码,等待页面加载,直接返回htmlresponse给spider解析,这里我们没有经过下载器下载,直接构造response的子类htmlresponse返回。(当下载器中间件返回response对象时,更低优先级的process_request将不在执行,转而执行其他的process_response()方法,本例中没有其他的process_response(),所以直接将结果返回给spider解析。)

1.4 解析页面

Response对象就会回传给Spider内的回调函数进行解析。所以下一步我们就实现其回调函数,对网页来进行解析。

def parse(self, response):
    soup = BeautifulSoup(response.text, 'lxml')
    lis = soup.find_all(name='li', class_="gl-item")
    for li in lis:
        proc_dict = {}
        dp = li.find(name='span', class_="J_im_icon")
        if dp:
            proc_dict['dp'] = dp.get_text().strip()
        else:
            continue
        id = li.attrs['data-sku']
        title = li.find(name='div', class_="p-name p-name-type-2")
        proc_dict['title'] = title.get_text().strip()
        price = li.find(name='strong', class_="J_" + id)
        proc_dict['price'] = price.get_text()
        comment = li.find(name='a', id="J_comment_" + id)
        proc_dict['comment'] = comment.get_text() + '条评论'
        url = 'https://item.jd.com/' + id + '.html'
        proc_dict['url'] = url
        proc_dict['type'] = 'JINGDONG'
        yield proc_dict

这里我们采用BeautifulSoup进行解析,匹配所有商品,随后对结果进行遍历,依次选取商品的各种信息。

1.5 储存结果

提取完页面数据之后,数据会发送到item pipeline处进行数据处理,清洗,入库等操作,所以我们此时当然需要定义项目管道了,在此我们将数据存储在mongodb数据库中。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo

class MongoPipeline(object):

    def __init__(self,mongo_url,mongo_db,collection):
        self.mongo_url = mongo_url
        self.mongo_db = mongo_db
        self.collection = collection

    @classmethod
    #from_crawler是一个类方法,由 @classmethod标识,是一种依赖注入的方式,它的参数就是crawler
    #通过crawler我们可以拿到全局配置的每个配置信息,在全局配置settings.py中的配置项都可以取到。
    #所以这个方法的定义主要是用来获取settings.py中的配置信息
    def from_crawler(cls,crawler):
        return cls(
            mongo_url=crawler.settings.get('MONGO_URL'),
            mongo_db = crawler.settings.get('MONGO_DB'),
            collection = crawler.settings.get('COLLECTION')
        )

    def open_spider(self,spider):
        self.client = pymongo.MongoClient(self.mongo_url)
        self.db = self.client[self.mongo_db]

    def process_item(self,item, spider):
        # name = item.__class__.collection
        name = self.collection
        self.db[name].insert(dict(item))
        return item

    def close_spider(self,spider):
        self.client.close()

1.6 配置settings文件

配置settings文件,将项目中使用到的配置项在settings文件中配置,本项目中使用到了KEYWORDS,MAX_PAGE,SELENIUM_TIMEOUT(页面加载超时时间),MONGOURL,MONGODB,COLLECTION。

KEYWORDS=['iPad']
MAX_PAGE=2

MONGO_URL = 'localhost'
MONGO_DB = 'test'
COLLECTION = 'ProductItem'

SELENIUM_TIMEOUT = 30

以及修改配置项,激活下载器中间件和item pipeline。

DOWNLOADER_MIDDLEWARES = {
   'scrapyseleniumtest.middlewares.SeleniumMiddleware': 543,
}

ITEM_PIPELINES = {
   'scrapyseleniumtest.pipelines.MongoPipeline': 300,
}

1.7 执行结果

项目中所有需要开发的代码和配置项开发完成,运行项目。

scrapy crawl jd

爬虫(十七):Scrapy框架(四) 对接selenium爬取京东商品数据 Python 第2张

运行项目之后,在mongodb中查看数据,已经执行成功。

爬虫(十七):Scrapy框架(四) 对接selenium爬取京东商品数据 Python 第3张

1.8 完整代码

items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
from scrapy import Item,Field

class ProductItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # dp = Field()
    # title = Field()
    # price = Field()
    # comment = Field()
    # url = Field()
    # type = Field()
    pass

jd.py:

# -*- coding: utf-8 -*-
from scrapy import Request,Spider
from urllib.parse import quote
from bs4 import BeautifulSoup

class JdSpider(Spider):
    name = 'jd'
    allowed_domains = ['www.jd.com']
    base_url = 'https://search.jd.com/Search?keyword='

    def start_requests(self):
        for keyword in self.settings.get('KEYWORDS'):
            for page in range(1, self.settings.get('MAX_PAGE') + 1):
                url = self.base_url + quote(keyword)
                # dont_filter = True  不去重
                yield Request(url=url, callback=self.parse, meta={'page': page}, dont_filter=True)

    def parse(self, response):
        soup = BeautifulSoup(response.text, 'lxml')
        lis = soup.find_all(name='li', class_="gl-item")
        for li in lis:
            proc_dict = {}
            dp = li.find(name='span', class_="J_im_icon")
            if dp:
                proc_dict['dp'] = dp.get_text().strip()
            else:
                continue
            id = li.attrs['data-sku']
            title = li.find(name='div', class_="p-name p-name-type-2")
            proc_dict['title'] = title.get_text().strip()
            price = li.find(name='strong', class_="J_" + id)
            proc_dict['price'] = price.get_text()
            comment = li.find(name='a', id="J_comment_" + id)
            proc_dict['comment'] = comment.get_text() + '条评论'
            url = 'https://item.jd.com/' + id + '.html'
            proc_dict['url'] = url
            proc_dict['type'] = 'JINGDONG'
            yield proc_dict

middlewares.py:

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from urllib.parse import urlencode
from scrapy.http import HtmlResponse
from logging import getLogger
from selenium.common.exceptions import TimeoutException
import time


class ScrapyseleniumtestSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class SeleniumMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def __init__(self,timeout=None):
        self.logger=getLogger(__name__)
        self.timeout = timeout
        self.browser = webdriver.Chrome()
        self.browser.set_window_size(1400,700)
        self.browser.set_page_load_timeout(self.timeout)
        self.wait = WebDriverWait(self.browser,self.timeout)

    def __del__(self):
        self.browser.close()

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'))

    def process_request(self, request, spider):
        '''
        在下载器中间件中对接使用selenium,输出源代码之后,构造htmlresponse对象,直接返回
        给spider解析页面,提取数据
        并且也不在执行下载器下载页面动作
        htmlresponse对象的文档:
        :param request:
        :param spider:
        :return:
        '''

        print('PhantomJS is Starting')
        page = request.meta.get('page', 1)
        self.wait = WebDriverWait(self.browser, self.timeout)
        # self.browser.set_page_load_timeout(30)
        # self.browser.set_script_timeout(30)
        try:
            self.browser.get(request.url)
            if page > 1:
                input = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > input')))
                input.clear()
                input.send_keys(page)
                time.sleep(5)

                # 将网页中输入跳转页的输入框赋值给input变量 EC.presence_of_element_located,判断输入框已经被加载出来
                input = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > input')))
                # 将网页中调准页面的确定按钮赋值给submit变量,EC.element_to_be_clickable 判断此按钮是可点击的
                submit = self.wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > a')))
                input.clear()
                input.send_keys(page)
                submit.click()  # 点击按钮
                time.sleep(5)

                # 判断当前页码出现在了输入的页面中,EC.text_to_be_present_in_element 判断元素在指定字符串中出现
                self.wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.curr'),str(page)))
                # 等待 #J_goodsList 加载出来,为页面数据,加载出来之后,在返回网页源代码
                self.wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.curr'),str(page)))
            return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf-8',status=200)
        except TimeoutException:
            return HtmlResponse(url=request.url, status=500, request=request)


    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

pipelines.py:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo

class MongoPipeline(object):

    def __init__(self,mongo_url,mongo_db,collection):
        self.mongo_url = mongo_url
        self.mongo_db = mongo_db
        self.collection = collection

    @classmethod
    #from_crawler是一个类方法,由 @classmethod标识,是一种依赖注入的方式,它的参数就是crawler
    #通过crawler我们可以拿到全局配置的每个配置信息,在全局配置settings.py中的配置项都可以取到。
    #所以这个方法的定义主要是用来获取settings.py中的配置信息
    def from_crawler(cls,crawler):
        return cls(
            mongo_url=crawler.settings.get('MONGO_URL'),
            mongo_db = crawler.settings.get('MONGO_DB'),
            collection = crawler.settings.get('COLLECTION')
        )

    def open_spider(self,spider):
        self.client = pymongo.MongoClient(self.mongo_url)
        self.db = self.client[self.mongo_db]

    def process_item(self,item, spider):
        # name = item.__class__.collection
        name = self.collection
        self.db[name].insert(dict(item))
        return item

    def close_spider(self,spider):
        self.client.close()

settings.py:

# -*- coding: utf-8 -*-

# Scrapy settings for scrapyseleniumtest project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'scrapyseleniumtest'

SPIDER_MODULES = ['scrapyseleniumtest.spiders']
NEWSPIDER_MODULE = 'scrapyseleniumtest.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapyseleniumtest (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'scrapyseleniumtest.middlewares.ScrapyseleniumtestSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'scrapyseleniumtest.middlewares.ScrapyseleniumtestDownloaderMiddleware': 543,
#}
DOWNLOADER_MIDDLEWARES = {
   'scrapyseleniumtest.middlewares.SeleniumMiddleware': 543,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'scrapyseleniumtest.pipelines.ScrapyseleniumtestPipeline': 300,
#}
ITEM_PIPELINES = {
   'scrapyseleniumtest.pipelines.MongoPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
KEYWORDS=['iPad']
MAX_PAGE=2

MONGO_URL = 'localhost'
MONGO_DB = 'test'
COLLECTION = 'ProductItem'

SELENIUM_TIMEOUT = 30
扫码关注我们
微信号:SRE实战
拒绝背锅 运筹帷幄