爬取猫眼怦然心动电影评论
作业要求来源:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3159
可以用pandas读出之前保存的数据:
SRE实战 互联网时代守护先锋,助力企业售后服务体系运筹帷幄!一键直达领取阿里云限量特价优惠。newsdf = pd.read_csv(r'F:\duym\gzccnews.csv')
一.把爬取的内容保存到数据库sqlite3
import sqlite3
with sqlite3.connect('gzccnewsdb.sqlite') as db:
newsdf.to_sql('gzccnews',con = db)
with sqlite3.connect('gzccnewsdb.sqlite') as db:
df2 = pd.read_sql_query('SELECT * FROM gzccnews',con=db)
保存到MySQL数据库
- import pandas as pd
- import pymysql
- from sqlalchemy import create_engine
- conInfo = "mysql+pymysql://user:passwd@host:port/gzccnews?charset=utf8"
- engine = create_engine(conInfo,encoding='utf-8')
- df = pd.DataFrame(allnews)
- df.to_sql(name = ‘news', con = engine, if_exists = 'append', index = False)
二.爬虫综合大作业
- 选择一个热点或者你感兴趣的主题。
- 选择爬取的对象与范围。
- 了解爬取对象的限制与约束。
- 爬取相应内容。
- 做数据分析与文本分析。
- 形成一篇文章,有说明、技术要点、有数据、有数据分析图形化展示与说明、文本分析图形化展示与说明。
- 文章公开发布。
我感兴趣的主题:最近重温怦然心动电影,爬取其评论
爬取对象:猫眼http://m.maoyan.com/movie/46818?_v_=yes&channelId=4&cityId=20&$from=canary#
获取的是猫眼APP的评论数据,如图所示::
通过分析发现猫眼APP的评论数据接口为:http://m.maoyan.com/review/v2/comments.json?movieId=46818&userId=-1&offset=0&limit=15&ts=0&type=3
代码实现:
先定义一个函数,用来根据指定url获取数据,且只能获取到指定的日期向前获取到15条评论数据
2019/5/5重新爬取代码
from bs4 import BeautifulSoup import requests import warnings import re from datetime import datetime import json import random import time import datetime import pandas as pd headers = { 'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1', 'Connection':'keep-alive'} cookies={'cookie':'_lxsdk_cuid=168c325f322c8-0156d0257eb33d-10326653-13c680-168c325f323c8; uuid_n_v=v1; iuuid=30E9F9E02A1911E9947B6716B6E91453A6754AA9248F40F39FBA1FD0A2AD9B42; webp=true; ci=191%2C%E5%8F%B0%E5%B7%9E; _lx_utm=utm_source%3DBaidu%26utm_medium%3Dorganic; __mta=49658649.1549462270794.1549465778684.1549548206227.3; _lxsdk=30E9F9E02A1911E9947B6716B6E91453A6754AA9248F40F39FBA1FD0A2AD9B42; _lxsdk_s=168c898414e-035-f0e-e6%7C%7C463'} #url设置offset偏移量为0 url = 'http://m.maoyan.com/review/v2/comments.json?movieId=46818&userId=-1&offset=0&limit=15&ts={}&type=3' comment=[] nick=[] score=[] comment_time=[] gender=[] userlevel=[] userid=[] upcount=[] replycount=[] ji=1 url_time=url_time=int(time.time())*1000#获取当前时间(单位是毫秒,所以要✖️1000) for i in range(100): value=15*i url_range=url.format(url_time) res=requests.get(url_range,headers=headers,cookies=cookies,timeout=10) res.encoding='utf-8' print('正在爬取第'+str(ji)+'页') content=json.loads(res.text,encoding='utf-8') # print(content) comments = content['data']['comments'] # print(comments) count=0 for item in comments: comment.append(item['content']) nick.append(item['nick']) score.append(item['score']) # list_=content['data']['comments'] # count=0 # for item in list_: # comment.append(item['content']) # nick.append(item['nick']) # score.append(item['score']) # comment_time.append(datetime.datetime.fromtimestamp(int(item['time']/1000))) # gender.append(item['gender']) # userlevel.append(item['userLevel']) # userid.append(item['userId']) # upcount.append(item['upCount']) # replycount.append(item['replyCount']) # count=count+1 # if count==15: # url_time=item['time'] ji+=1 time.sleep(random.random()) print('爬取完成') print(url_time) print(comment) pd.DataFrame(comment).to_csv('评论内容.txt', encoding='utf-8') # result={'用户id':userid,'用户昵称':nick,'评分':score,'评论内容':comment} # results=pd.DataFrame(result) # results.info() # results.to_excel('猫眼_怦然心动.xlsx')

更多精彩