作业要求来源:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3159

SRE实战 互联网时代守护先锋,助力企业售后服务体系运筹帷幄!一键直达领取阿里云限量特价优惠。

可以用pandas读出之前保存的数据:

newsdf = pd.read_csv(r'E:\大三用的软件\PyCharm Community Edition 2018.3.5\homework\gzccnews.csv')

一.把爬取的内容保存到数据库sqlite3

import sqlite3
with sqlite3.connect('gzccnewsdb.sqlite') as db:
newsdf.to_sql('gzccnews',con = db)

with sqlite3.connect('gzccnewsdb.sqlite') as db:
df2 = pd.read_sql_query('SELECT * FROM gzccnews',con=db)

 爬取虎扑绝地求生帖子数据 随笔 第1张

 

保存到MySQL数据库

    • import pandas as pd
    • import pymysql
    • from sqlalchemy import create_engine
    • conInfo = "mysql+pymysql://user:passwd@host:port/gzccnews?charset=utf8"
    • engine = create_engine(conInfo,encoding='utf-8')
    • df = pd.DataFrame(allnews)
    • df.to_sql(name = ‘news', con = engine, if_exists = 'append', index = False)

 爬取虎扑绝地求生帖子数据 随笔 第2张

 

二.爬虫综合大作业

  1. 选择一个热点或者你感兴趣的主题。
  2. 选择爬取的对象与范围。
  3. 了解爬取对象的限制与约束。
  4. 爬取相应内容。
  5. 做数据分析与文本分析。
  6. 形成一篇文章,有说明、技术要点、有数据、有数据分析图形化展示与说明、文本分析图形化展示与说明。
  7. 文章公开发布。

 我感兴趣的主题:因为宿舍里好多人喜欢上虎扑,所以我就随便找了个主题来爬取,感觉吃鸡挺火的,就爬他的帖子了

爬取虎扑绝地求生帖子数据 随笔 第3张

首先导入制作爬虫需要的包:

from bs4 import BeautifulSoup
import requests
import chardet
import numpy as np

生成爬虫的函数:

def creat_bs(url):
    result = requests.get(url)
    e=chardet.detect(result.content)['encoding']
    #set the code of request object to the webpage's code
    result.encoding=e
    c = result.content
    soup =BeautifulSoup(c,'lxml')
    return soup

接着构建所要获取网页的集合函数:

def build_urls(prefix,suffix):
    urls=[]
    for item in suffix:
        url=prefix+item
        urls.append(url)
    return urls

开始编写爬取函数:

def find_title_link(soup):
    titles=[]
    links=[]
    try:
        contanier=soup.find('div',{'class':'container_padd'})
        ajaxtable=contanier.find('form',{'id':'ajaxtable'})
        page_list=ajaxtable.find_all('li')
        for page in page_list:
            titlelink=page.find('a',{'class':'truetit'})
            if titlelink.text==None:
                title=titlelink.find('b').text
            else:
                title=titlelink.text
            if np.random.uniform(0,1)>0.90:
                link=titlelink.get('href')
                titles.append(title)
                links.append(link)
    except:
        print('have no value')
    return titles,links

def find_reply(soup):
    replys=[]
    try:
        details=soup.find('div',{'class':'hp-wrap details'})
        form=details.find('form')
        floors=form.find_all('div',{'class':'floor'})
        for floor in floors:
            table=floor.find('table',{'class':'case'})
            if floor.id!='tpc':
                if table.find('p')!=None:
                    reply=table.find('p').text
                else:
                    reply=table.find('td').text
                replys.append(reply)
            elif floor.id=='tpc':
                continue
    except:
        return None
    return replys

构建完函数后,便可以开始爬取数据了,首先创建爬取页面的url集合,这次我选择了前30页贴子进行爬取:

url='https://bbs.hupu.com/pubg'
page_suffix=['','-2','-3','-4','-5','-6','-7','-8','-9','-10','-11','-12',
'-13','-14','-15','-16','-17','-18','-19','-20','-21','-22','-23','-24',
'-25','-26','-27','-28','-29','-30']
urls=build_urls(url,page_suffix)

爬取标题和数据:

title_group=[]
link_group=[]
for url in urls:
    soup = creat_bs(url)
    titles, links = find_title_link(soup)
    for title in titles:
        title_group.append(title)
    for link in links:
        link_group.append(link)
        

接着爬取每个帖子的第一页回复:

reply_urls=build_urls('https://bbs.hupu.com',link_group)
reply_group=[]
for url in reply_urls:
    soup=creat_bs(url)
    replys=find_reply(soup)
    if replys!=None:
        for reply in replys:
            reply_group.append(reply)

爬取完毕后,综合所有数据并保存:

wordlist=str()
for title in title_group:
    wordlist+=title

for reply in reply_group:
    wordlist+=reply

def savetxt(wordlist):
    f=open('wordlist.txt','wb')
    f.write(wordlist.encode('utf8'))
    f.close()
savetxt(wordlist)

词云图的制作:

import jieba
jieba.load_userdict('user_dict.txt')
wordlist_af_jieba=jieba.cut_for_search(wordlist)
wl_space_split=' '.join(wordlist_af_jieba)

from wordcloud import WordCloud,STOPWORDS
import matplotlib.pyplot as plt
stopwords=set(STOPWORDS)
fstop=open('stopwords.txt','r')
for eachWord in fstop:
    stopwords.add(eachWord.decode('utf-8'))

wc=WordCloud(font_path=r'C:\Windows\Fonts\STHUPO.ttf', background_color='black',max_words=200,width=700,height=1000,stopwords=stopwords,max_font_size=100,random_state=30)
wc.generate(wl_space_split)
wc.to_file('hupu_pubg2.png')
plt.imshow(wc,interpolation='bilinear')
plt.axis('off')

自己设置的user_dict.txt

爬取虎扑绝地求生帖子数据 随笔 第4张

爬取帖子评论的部分截图:

爬取虎扑绝地求生帖子数据 随笔 第5张

词云图:

爬取虎扑绝地求生帖子数据 随笔 第6张

 

扫码关注我们
微信号:SRE实战
拒绝背锅 运筹帷幄