python大作业，,利用python对豆

文章由Byrx.net分享于2020-12-14 04:12:53评论（179）

python大作业，,利用python对豆

利用python对豆瓣电影评价的爬取，并生成词云

一、抓取网页数据

第一步要对网页进行访问，python中使用的是urllib库。代码如下：

from urllib import requestresp = request.urlopen(‘https://movie.douban.com/nowplaying/hangzhou/‘)html_data = resp.read().decode(‘utf-8‘)

第二步，需要对得到的html代码进行解析，得到里面提取我们需要的数据。

在python中使用BeautifulSoup库进行html代码的解析。

BeautifulSoup使用的格式如下：

BeautifulSoup(html,"html.parser")

第一个参数为需要提取数据的html，第二个参数是指定解析器，然后使用find_all()读取html标签中的内容

from bs4 import BeautifulSoup as bs soup = bs(html_data, ‘html.parser‘) nowplaying_movie = soup.find_all(‘div‘, id=‘nowplaying‘) nowplaying_movie_list = nowplaying_movie[0].find_all(‘li‘, class_=‘list-item‘)

在上图中可以看到data-subject属性里面放了电影的id号码，而在img标签的alt属性里面放了电影的名字，因此我们就通过这两个属性来得到电影的id和名称。（注：打开电影短评的网页时需要用到电影的id，所以需要对它进行解析），编写代码如下：

nowplaying_list = [] for item in nowplaying_movie_list:                nowplaying_dict = {}                nowplaying_dict[‘id‘] = item[‘data-subject‘]               for tag_img_item in item.find_all(‘img‘):                        nowplaying_dict[‘name‘] = tag_img_item[‘alt‘]                        nowplaying_list.append(nowplaying_dict)

`二、数据清洗`

为了方便进行数据进行清洗，我们将列表中的数据放在一个字符串数组中，代码如下：

comments = ‘‘for k in range(len(eachCommentList)):    comments = comments + (str(eachCommentList[k])).strip()

`三、用词云进行显示`

代码如下：

import matplotlib.pyplot as plt%matplotlib inlineimport matplotlibmatplotlib.rcParams[‘figure.figsize‘] = (10.0, 5.0)from wordcloud import WordCloud#词云包wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80) #指定字体类型、字体大小和字体颜色word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}word_frequence_list = []for key in word_frequence:    temp = (key,word_frequence[key])    word_frequence_list.append(temp)wordcloud=wordcloud.fit_words(word_frequence_list)plt.imshow(wordcloud)




付源码：

完整代码# -*- coding: utf-8 -*-import warningswarnings.filterwarnings("ignore")import jieba  # 分词包import numpy  # numpy计算包import codecs  # codecs提供的open方法来指定打开的文件的语言编码，它会在读取的时候自动转换为内部unicodeimport reimport pandas as pdimport matplotlib.pyplot as pltfrom PIL import Imagefrom urllib import requestfrom bs4 import BeautifulSoup as bsfrom wordcloud import WordCloud,ImageColorGenerator # 词云包import matplotlibmatplotlib.rcParams[‘figure.figsize‘] = (10.0, 5.0)# 分析网页函数def getNowPlayingMovie_list():    resp = request.urlopen(‘https://movie.douban.com/nowplaying/hangzhou/‘)    html_data = resp.read().decode(‘utf-8‘)    soup = bs(html_data, ‘html.parser‘)    nowplaying_movie = soup.find_all(‘div‘, id=‘nowplaying‘)    nowplaying_movie_list = nowplaying_movie[0].find_all(‘li‘, class_=‘list-item‘)    nowplaying_list = []    for item in nowplaying_movie_list:        nowplaying_dict = {}        nowplaying_dict[‘id‘] = item[‘data-subject‘]        for tag_img_item in item.find_all(‘img‘):            nowplaying_dict[‘name‘] = tag_img_item[‘alt‘]            nowplaying_list.append(nowplaying_dict)    return nowplaying_list# 爬取评论函数def getCommentsById(movieId, pageNum):    eachCommentList = []    if pageNum > 0:        start = (pageNum - 1) * 20    else:        return False    requrl = ‘https://movie.douban.com/subject/‘ + movieId + ‘/comments‘ + ‘?‘ + ‘start=‘ + str(start) + ‘&limit=20‘    print(requrl)    resp = request.urlopen(requrl)    html_data = resp.read().decode(‘utf-8‘)    soup = bs(html_data, ‘html.parser‘)    comment_div_lits = soup.find_all(‘div‘, class_=‘comment‘)    for item in comment_div_lits:        if item.find_all(‘p‘)[0].string is not None:            eachCommentList.append(item.find_all(‘p‘)[0].string)    return eachCommentListdef main():    # 循环获取第一个电影的前10页评论    commentList = []    NowPlayingMovie_list = getNowPlayingMovie_list()    for i in range(10):        num = i + 1        commentList_temp = getCommentsById(NowPlayingMovie_list[0][‘id‘], num)        commentList.append(commentList_temp)    # 将列表中的数据转换为字符串    comments = ‘‘    for k in range(len(commentList)):        comments = comments + (str(commentList[k])).strip()    # 使用正则表达式去除标点符号    pattern = re.compile(r‘[\u4e00-\u9fa5]+‘)    filterdata = re.findall(pattern, comments)    cleaned_comments = ‘‘.join(filterdata)    # 使用结巴分词进行中文分词    segment = jieba.lcut(cleaned_comments)    words_df = pd.DataFrame({‘segment‘: segment})    # 去掉停用词    stopwords = pd.read_csv("stopwords.txt", index_col=False, quoting=3, sep="\t", names=[‘stopword‘],                            encoding=‘utf-8‘)  # quoting=3全不引用    words_df = words_df[~words_df.segment.isin(stopwords.stopword)]    # 统计词频    words_stat = words_df.groupby(by=[‘segment‘])[‘segment‘].agg({"计数": numpy.size})    words_stat = words_stat.reset_index().sort_values(by=["计数"], ascending=False)    #  print(words_stat.head())    bg_pic = numpy.array(Image.open("alice_mask.png"))    # 用词云进行显示    wordcloud = WordCloud(        font_path="simhei.ttf",        background_color="white",        max_font_size=80,        width = 2000,        height = 1800,        mask = bg_pic,        mode = "RGBA"    )    word_frequence = {x[0]: x[1] for x in words_stat.head(1000).values}    # print(word_frequence)    """    word_frequence_list = []    for key in word_frequence:        temp = (key, word_frequence[key])        word_frequence_list.append(temp)        #print(word_frequence_list)    """    wordcloud = wordcloud.fit_words(word_frequence)    image_colors = ImageColorGenerator(bg_pic) # 根据图片生成词云颜色    plt.imshow(wordcloud) #显示词云图片    plt.axis("off")    plt.show()    wordcloud.to_file(‘show_Chinese.png‘)  # 把词云保存下来main()

python大作业





 今日最新热门文章：
 python中的@，,copy frome
python格式化函数foramt，,#通过位置print
Python 文档操作，,from docx 
python-1:Number数字类型 之二   相关函数 int.fr
彻底搞懂Python 中的 import 与 from import，,对不
python自动化之函数封装，,前面一些记录了s

 

 相关内容

 Python字符串学习，,-- coding:
Python内置函数之str()，,class str(
python格式化函数foramt，,#通过位置print
用python写的自动转发邮件信息模板，,# -*- codi
python3, 解析迅雷地址为原地址，,解析迅雷地址为原地址
python实现dns查询，,dnspython模
在docker容器中部署python-selenium+chrome-headless自动化脚本
Python 字符串格式化%与format() 函数 九，,这是12月规划的
python自动化之函数封装，,前面一些记录了sel
python分段计费demo，,根据以下信息提示，请

推荐教程：python教程  python问答  python源码实例  python开发工具  python框架

python大作业，,利用python对豆