scrapy持久化存储的几种方式的简介

佚名 7年前 (2019-04-10) 随笔 1708人围观抢沙发百度已收录

从存储方式上分为2种，存入磁盘和数据库。

如下是原始的爬虫代码

SRE实战互联网时代守护先锋，助力企业售后服务体系运筹帷幄！一键直达领取阿里云限量特价优惠。

# -*- coding: utf-8 -*-
import scrapy


class FirstfileSpider(scrapy.Spider):
    name = 'firstfile'
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        # 使用框架的xpath接口
        list_div = response.xpath('//div[@id="content-left"]/div')
        for div in list_div:
            author = div.xpath("./div/a[2]/h2/text()").extract()[0]
            content = div.xpath("./a/div/span/text()").extract()[0]

a.磁盘文件

1.基于终端指令的持久化存储

首先保证parse方法返回一个可迭代类型的对象(存储解析到的页面内容)，再使用终端指令完成数据存储到制定磁盘文件中的操作。代码修改如下

# -*- coding: utf-8 -*-
import scrapy


class FirstfileSpider(scrapy.Spider):
    name = 'firstfile'
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        # 使用框架的xpath接口
        list_div = response.xpath('//div[@id="content-left"]/div')
        # 存储解析到的页面数据
        data_list = []
        for div in list_div:
            author = div.xpath("./div/a[2]/h2/text()").extract()[0]
            content = div.xpath("./a/div/span/text()").extract()[0]

            res_dict = {
                "author": author,
                "content": content,
            }
            data_list.append(res_dict)
        return data_list

终端输入参数为：scrapy crawl 你的文件名 –o test.csv

当然你也可以用其他的格式。在打开test.csv可以看见结果。

2.基于管道的持久化存储

如同Django里面的models一样,items.py用来设置数据存储模版，结构化数据。pipelines用来数据持久化处理。总的来说就是items先存储解析到的页面数据,然后pipelines处理持久化存储的相关操作.

代码的实现流程：

1.将解析到的页面数据存储到items对象

2.使用yield关键字将items提交给管道文件进行处理

3.在管道文件编写代码完成数据存储的操作

4.在配置文件中开启管道操作

firstproject(爬虫文件名字).py

# -*- coding: utf-8 -*-
import scrapy
from firstproject.items import FirstprojectItem


class FirstfileSpider(scrapy.Spider):
    name = 'firstfile'
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        list_div = response.xpath('//div[@id="content-left"]/div')
        for div in list_div:
            author = div.xpath("./div/a[2]/h2/text()").extract()[0]
            content = div.xpath(".//div[@class='content']/span/text()").extract()[0]
            print(author)
            print(content)
            # 1 存储到items对象
            item = FirstprojectItem()
            item['author'] = author
            item['content'] = content
            # 2 提交给管道
            yield item

items.py

import scrapy

class FirstprojectItem(scrapy.Item):
    # define the fields for your item here like:
    author = scrapy.Field()
    content = scrapy.Field()

pipelines.py

class FirstprojectPipeline(object):
    fp = None

    def open_spider(self, spider):
        # 该方法再整个流程中被调用一次
        print("开始爬虫")
        self.fp = open("./test.txt", "w", encoding="utf-8")
    # 爬虫每向管道提交一次item，该方法就会被执行一次

    def process_item(self, item, spider):

ITEM_PIPELINES = {
   'firstproject.pipelines.FirstprojectPipeline': 300,
}

        # item就是接受到的item对象
        author = item["author"]
        content = item["content"]
        # 持久化存储操作
        self.fp.write(author+":"+content)
        return item

    def close_spider(self, spider):
        print("爬虫结束")
        self.fp.close()

settings.py中添加这一行

ITEM_PIPELINES = {
   'firstproject.pipelines.FirstprojectPipeline': 300,
}

b.数据库

代码流程:类似于管道持久化,区别就是pipelines文件中持久化方式的区别

1.基于mysql持久化存储

在上面的基础上修改下pipelines文件,其他注意提前在mysql创建好相应的表的列和行

import pymysql
class FirstprojectPipeline(object):
    conn = None
    cursor = None

    def open_spider(self):
        self.conn = pymysql.Connect(host="127.0.0.1", port=3306, user='root')

    def process_item(self, item, spider):
        # item就是接受到的item对象
        author = item["author"]
        content = item["content"]
        sql = 'insert into qiubai values ("%s","%s")' % (author, content)
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item

    def close_spider(self):
        self.cursor.close()

2.基于redis存储

由于windows对redis支持不友好,所以我们这里在Linux上开发。非分布式的爬虫在这里实际上和基于mysql只有pipelines文件的区别。

首先开启好redis服务，再连接即可。注意这里的python的redis模块使用的是2.10.6版本。使用命令pip3 install redis-2.10.6来回退版本。

import redis
class FirstprojectPipeline(object):
    conn = None

    def open_spider(self, spider):
        print("start spider")
        self.conn = redis.Redis(host="127.0.0.1", port=6379)

    def process_item(self, item, spider):
        my_dict = {
            "author": item['author'],
            'content': item['content'],
        }
        print(my_dict)
        try:
            self.conn.rpush("my_data", my_dict)
        except Exception as e:
            print(e)
        return item

然后在终端输入redis-cli开启客户端，输入hgetall my_data查看是否写入即可。可能会抽风，等待下网络即可。