爬取虎扑绝地求生帖子数据
作业要求来源:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3159
SRE实战 互联网时代守护先锋,助力企业售后服务体系运筹帷幄!一键直达领取阿里云限量特价优惠。可以用pandas读出之前保存的数据:
newsdf = pd.read_csv(r'E:\大三用的软件\PyCharm Community Edition 2018.3.5\homework\gzccnews.csv')
一.把爬取的内容保存到数据库sqlite3
import sqlite3
with sqlite3.connect('gzccnewsdb.sqlite') as db:
newsdf.to_sql('gzccnews',con = db)
with sqlite3.connect('gzccnewsdb.sqlite') as db:
df2 = pd.read_sql_query('SELECT * FROM gzccnews',con=db)
保存到MySQL数据库
- import pandas as pd
- import pymysql
- from sqlalchemy import create_engine
- conInfo = "mysql+pymysql://user:passwd@host:port/gzccnews?charset=utf8"
- engine = create_engine(conInfo,encoding='utf-8')
- df = pd.DataFrame(allnews)
- df.to_sql(name = ‘news', con = engine, if_exists = 'append', index = False)
二.爬虫综合大作业
- 选择一个热点或者你感兴趣的主题。
- 选择爬取的对象与范围。
- 了解爬取对象的限制与约束。
- 爬取相应内容。
- 做数据分析与文本分析。
- 形成一篇文章,有说明、技术要点、有数据、有数据分析图形化展示与说明、文本分析图形化展示与说明。
- 文章公开发布。
我感兴趣的主题:因为宿舍里好多人喜欢上虎扑,所以我就随便找了个主题来爬取,感觉吃鸡挺火的,就爬他的帖子了
首先导入制作爬虫需要的包:
from bs4 import BeautifulSoup import requests import chardet import numpy as np
生成爬虫的函数:
def creat_bs(url): result = requests.get(url) e=chardet.detect(result.content)['encoding'] #set the code of request object to the webpage's code result.encoding=e c = result.content soup =BeautifulSoup(c,'lxml') return soup
接着构建所要获取网页的集合函数:
def build_urls(prefix,suffix): urls=[] for item in suffix: url=prefix+item urls.append(url) return urls
开始编写爬取函数:
def find_title_link(soup): titles=[] links=[] try: contanier=soup.find('div',{'class':'container_padd'}) ajaxtable=contanier.find('form',{'id':'ajaxtable'}) page_list=ajaxtable.find_all('li') for page in page_list: titlelink=page.find('a',{'class':'truetit'}) if titlelink.text==None: title=titlelink.find('b').text else: title=titlelink.text if np.random.uniform(0,1)>0.90: link=titlelink.get('href') titles.append(title) links.append(link) except: print('have no value') return titles,links def find_reply(soup): replys=[] try: details=soup.find('div',{'class':'hp-wrap details'}) form=details.find('form') floors=form.find_all('div',{'class':'floor'}) for floor in floors: table=floor.find('table',{'class':'case'}) if floor.id!='tpc': if table.find('p')!=None: reply=table.find('p').text else: reply=table.find('td').text replys.append(reply) elif floor.id=='tpc': continue except: return None return replys
构建完函数后,便可以开始爬取数据了,首先创建爬取页面的url集合,这次我选择了前30页贴子进行爬取:
url='https://bbs.hupu.com/pubg' page_suffix=['','-2','-3','-4','-5','-6','-7','-8','-9','-10','-11','-12', '-13','-14','-15','-16','-17','-18','-19','-20','-21','-22','-23','-24', '-25','-26','-27','-28','-29','-30'] urls=build_urls(url,page_suffix)
爬取标题和数据:
title_group=[] link_group=[] for url in urls: soup = creat_bs(url) titles, links = find_title_link(soup) for title in titles: title_group.append(title) for link in links: link_group.append(link)
接着爬取每个帖子的第一页回复:
reply_urls=build_urls('https://bbs.hupu.com',link_group) reply_group=[] for url in reply_urls: soup=creat_bs(url) replys=find_reply(soup) if replys!=None: for reply in replys: reply_group.append(reply)
爬取完毕后,综合所有数据并保存:
wordlist=str() for title in title_group: wordlist+=title for reply in reply_group: wordlist+=reply def savetxt(wordlist): f=open('wordlist.txt','wb') f.write(wordlist.encode('utf8')) f.close() savetxt(wordlist)
词云图的制作:
import jieba jieba.load_userdict('user_dict.txt') wordlist_af_jieba=jieba.cut_for_search(wordlist) wl_space_split=' '.join(wordlist_af_jieba) from wordcloud import WordCloud,STOPWORDS import matplotlib.pyplot as plt stopwords=set(STOPWORDS) fstop=open('stopwords.txt','r') for eachWord in fstop: stopwords.add(eachWord.decode('utf-8')) wc=WordCloud(font_path=r'C:\Windows\Fonts\STHUPO.ttf', background_color='black',max_words=200,width=700,height=1000,stopwords=stopwords,max_font_size=100,random_state=30) wc.generate(wl_space_split) wc.to_file('hupu_pubg2.png') plt.imshow(wc,interpolation='bilinear') plt.axis('off')
自己设置的user_dict.txt
爬取帖子评论的部分截图:
词云图:

更多精彩