null - 程序员宅基地

Python小白逆袭大神七日打卡营飞桨paddlepaddle_path = 'work/' 'pics/' name '/-程序员宅基地

技术标签： paddlepaddle Python入门AI爬虫 python markdown 百度七日打卡营

这里写自定义目录标题

Python小白逆袭大神七日打卡营全纪录

day1-Python基础

作业一：输出 9*9 乘法口诀表(注意格式)
注意：提交作业时要有代码执行输出结果。

def table():
    #在这里写下您的乘法口诀表代码吧！
    for i in range(1,10): #大循环9次
        str_row = ""#每一行的字符串 9行
        for j in range(1,i+1):
            str_row += "{0}*{1}={2}".format(j,i,i*j)+"  "
        print(str_row)


if __name__ == '__main__':
    table()

11=1
12=2 22=4
13=3 23=6 33=9
14=4 24=8 34=12 44=16
15=5 25=10 35=15 45=20 55=25
16=6 26=12 36=18 46=24 56=30 66=36
17=7 27=14 37=21 47=28 57=35 67=42 77=49
18=8 28=16 38=24 48=32 58=40 68=48 78=56 88=64
19=9 29=18 39=27 49=36 59=45 69=54 79=63 89=72 9*9=81

作业二：查找特定名称文件
遍历”Day1-homework”目录下文件；
找到文件名包含“2020”的文件；
将文件名保存到数组result中；
按照序号、文件名分行打印输出。
注意：提交作业时要有代码执行输出结果。

#导入OS模块
import os
#待搜索的目录路径
path = "Day1-homework"
#待搜索的名称
filename = "2020"
#定义保存结果的数组
result = []

def findfiles():
    #在这里写下您的查找文件代码吧！
    i = 1    #要求文件的序号
    for dirpath,dirnames,sub_filenames in os.walk(path):
        #对文件有”2020“进行删选
        for sub_filename in sub_filenames:
            str_sub_filename = str(sub_filename)
            if(str_sub_filename.find(filename,0,len(str_sub_filename))!=-1):
                result.append(sub_filename)    
                #将指定文件加入result
                print('{}, \''.format(i)+dirpath+sub_filename+'\'')
                i = i+1    #序号递增

    

if __name__ == '__main__':
    findfiles()

1, ‘Day1-homework/4/2204:22:2020.txt’
2, ‘Day1-homework/26/26new2020.txt’
3, ‘Day1-homework/18182020.doc’

day2-《青春有你2》选手信息爬取

度学习一般过程:
收集数据，尤其是有标签、高质量的数据是一件昂贵的工作。
爬虫的过程，就是模仿浏览器的行为，往目标站点发送请求，接收服务器的响应数据，提取需要的信息，并进行保存的过程。
Python为爬虫的实现提供了工具:requests模块、BeautifulSoup库

任务描述

本次实践使用Python来爬取百度百科中《青春有你2》所有参赛选手的信息。
数据获取：https://baike.baidu.com/item/青春有你第二季
上网的全过程:
普通用户:
打开浏览器 --> 往目标站点发送请求 --> 接收响应数据 --> 渲染到页面上。
爬虫程序:
模拟浏览器 --> 往目标站点发送请求 --> 接收响应数据 --> 提取有用的数据 --> 保存到本地/数据库。
爬虫的过程：
1.发送请求（requests模块）
2.获取响应数据（服务器返回）
3.解析并提取数据（BeautifulSoup查找或者re正则）
4.保存数据

本实践中将会使用以下两个模块，首先对这两个模块简单了解以下：
request模块：
requests是python实现的简单易用的HTTP库，官网地址：http://cn.python-requests.org/zh_CN/latest/
requests.get(url)可以发送一个http get请求，返回服务器响应内容。

BeautifulSoup库：
BeautifulSoup 是一个可以从HTML或XML文件中提取数据的Python库。网址：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml。
BeautifulSoup(markup, “html.parser”)或者BeautifulSoup(markup, “lxml”)，推荐使用lxml作为解析器,因为效率更高。

！！！作业说明！！！
1.请在下方提示位置，补充代码，完成《青春有你2》选手图片爬取，将爬取图片进行保存，保证代码正常运行
2.打印爬取的所有图片的绝对路径，以及爬取的图片总数，此部分已经给出代码。请在提交前，一定要保证有打印结果

一、爬取百度百科中《青春有你2》中所有参赛选手信息，返回页面数据

import json
import re
import requests
import datetime
from bs4 import BeautifulSoup
import os

#获取当天的日期,并进行格式化,用于后面文件命名，格式:20200420
today = datetime.date.today().strftime('%Y%m%d')    

def crawl_wiki_data():
    """
    爬取百度百科中《青春有你2》中参赛选手信息，返回html
    """
    headers = {
     
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    url='https://baike.baidu.com/item/青春有你第二季'                         

    try:
        response = requests.get(url,headers=headers)
        print(response.status_code)

        #将一段文档传入BeautifulSoup的构造方法,就能得到一个文档的对象, 可以传入一段字符串
        soup = BeautifulSoup(response.text,'lxml')
        
        #返回的是class为table-view log-set-param的<table>所有标签
        tables = soup.find_all('table',{
    'class':'table-view log-set-param'})

        crawl_table_title = "参赛学员"

        for table in  tables:           
            #对当前节点前面的标签和字符串进行查找
            table_titles = table.find_previous('div').find_all('h3')
            for title in table_titles:
                if(crawl_table_title in title):
                    return table       
    except Exception as e:
        print(e)

二、对爬取的页面数据进行解析，并保存为JSON文件

def parse_wiki_data(table_html):
    '''
    从百度百科返回的html中解析得到选手信息，以当前日期作为文件名，存JSON文件,保存到work目录下
    '''
    bs = BeautifulSoup(str(table_html),'lxml')
    all_trs = bs.find_all('tr')

    error_list = ['\'','\"']

    stars = []

    for tr in all_trs[1:]:
         all_tds = tr.find_all('td')

         star = {
    }

         #姓名
         star["name"]=all_tds[0].text
         #个人百度百科链接
         star["link"]= 'https://baike.baidu.com' + all_tds[0].find('a').get('href')
         #籍贯
         star["zone"]=all_tds[1].text
         #星座
         star["constellation"]=all_tds[2].text
         #身高
         star["height"]=all_tds[3].text
         #体重
         star["weight"]= all_tds[4].text

         #花语,去除掉花语中的单引号或双引号
         flower_word = all_tds[5].text
         for c in flower_word:
             if  c in error_list:
                 flower_word=flower_word.replace(c,'')
         star["flower_word"]=flower_word 
         
         #公司
         if not all_tds[6].find('a') is  None:
             star["company"]= all_tds[6].find('a').text
         else:
             star["company"]= all_tds[6].text  

         stars.append(star)

    json_data = json.loads(str(stars).replace("\'","\""))   
    with open('work/' + today + '.json', 'w', encoding='UTF-8') as f:
        json.dump(json_data, f, ensure_ascii=False)

三、爬取每个选手的百度百科图片，并进行保存

def crawl_pic_urls():
    '''
    爬取每个选手的百度百科图片，并保存
    ''' 
    with open('work/'+ today + '.json', 'r', encoding='UTF-8') as file:
         json_array = json.loads(file.read())

    headers = {
     
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 
     }

    for star in json_array:

        name = star['name']
        link = star['link']

        #！！！请在以下完成对每个选手图片的爬取，将所有图片url存储在一个列表pic_urls中！！！

        #向选手个人百度百科发送http get请求
        response = requests.get(link,headers=headers)

        #将一段文档传入beautifulsoup构造，得到一个文档对象
        bs = BeautifulSoup(response.text,'lxml')

        #从个人百度百科页解析链接，指向选手图片列表页
        pic_list_url = bs.select('.summary-pic a')[0].get('href')
        pic_list_url = 'https://baike.baidu.com' + pic_list_url

        #向选手图片列表页发送http get请求
        pic_list_response = requests.get(pic_list_url,headers=headers)

        #对选手图片列表页面发送http get请求
        bs = BeautifulSoup(pic_list_response.text,'lxml')
        pic_list_html = bs.select('.pic-list img ')

        pic_urls = []
        for pic_html in pic_list_html:
            pic_url = pic_html.get('src')
            pic_urls.append(pic_url)     
        


        #！！！根据图片链接列表pic_urls, 下载所有图片，保存在以name命名的文件夹中！！！
        down_pic(name,pic_urls)


def down_pic(name,pic_urls):
    '''
    根据图片链接列表pic_urls, 下载所有图片，保存在以name命名的文件夹中,
    '''
    path = 'work/'+'pics/'+name+'/'

    if not os.path.exists(path):
      os.makedirs(path)

    for i, pic_url in enumerate(pic_urls):
        try:
            pic = requests.get(pic_url, timeout=15)
            string = str(i + 1) + '.jpg'
            with open(path+string, 'wb') as f:
                f.write(pic.content)
                print('成功下载第%s张图片: %s' % (str(i + 1), str(pic_url)))
        except Exception as e:
            print('下载第%s张图片时失败: %s' % (str(i + 1), str(pic_url)))
            print(e)
            continue

四、打印爬取的所有图片的路径

def show_pic_path(path):
    '''
    遍历所爬取的每张图片，并打印所有图片的绝对路径
    '''
    pic_num = 0
    for (dirpath,dirnames,filenames) in os.walk(path):
        for filename in filenames:
           pic_num += 1
           print("第%d张照片：%s" % (pic_num,os.path.join(dirpath,filename)))           
    print("共爬取《青春有你2》选手的%d照片" % pic_num)

if __name__ == '__main__':

     #爬取百度百科中《青春有你2》中参赛选手信息，返回html
     html = crawl_wiki_data()

     #解析html,得到选手信息，保存为json文件
     parse_wiki_data(html)

     #从每个选手的百度百科页面上爬取图片,并保存
     crawl_pic_urls()

     #打印所爬取的选手图片路径
     show_pic_path('/home/aistudio/work/pics/')

     print("所有信息爬取完成！")

200
成功下载第1张图片: https://bkimg.cdn.bcebos.com/pic/faf2b2119313b07eca80d4dd909f862397dda0442687?x-bce-process=image/resize,m_lfit,h_160,limit_1
……
第481张照片：/home/aistudio/work/pics/申洁/1.jpg
第482张照片：/home/aistudio/work/pics/魏辰/1.jpg
共爬取《青春有你2》选手的482照片
所有信息爬取完成！

day3-《青春有你2》选手数据分析

任务描述：
基于第二天实践使用Python来爬去百度百科中《青春有你2》所有参赛选手的信息，进行数据可视化分析。

# 如果需要进行持久化安装, 需要使用持久化路径, 如下方代码示例:
#!mkdir /home/aistudio/external-libraries
#!pip install matplotlib -t /home/aistudio/external-libraries

# 同时添加如下代码, 这样每次环境(kernel)启动的时候只要运行下方代码即可:
# Also add the following code, so that every time the environment (kernel) starts, just run the following code:
import sys
sys.path.append('/home/aistudio/external-libraries')

# 下载中文字体
!wget https://mydueros.cdn.bcebos.com/font/simhei.ttf
# 将字体文件复制到matplotlib字体路径
!cp simhei.ttf /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/mpl-data/fonts/ttf/
# 一般只需要将字体文件复制到系统字体目录下即可，但是在aistudio上该路径没有写权限，所以此方法不能用
# !cp simhei.ttf /usr/share/fonts/

# 创建系统字体文件路径
!mkdir .fonts
# 复制文件到该路径
!cp simhei.ttf .fonts/
!rm -rf .cache/matplotlib

import matplotlib.pyplot as plt
import numpy as np 
import json
import matplotlib.font_manager as font_manager

#显示matplotlib生成的图形
%matplotlib inline

with open('data/data31557/20200422.json', 'r', encoding='UTF-8') as file:
         json_array = json.loads(file.read())

#绘制小姐姐区域分布柱状图,x轴为地区，y轴为该区域的小姐姐数量

zones = []
for star in json_array:
    zone = star['zone']
    zones.append(zone)
print(len(zones))
print(zones)


zone_list = []
count_list = []

for zone in zones:
    if zone not in zone_list:
        count = zones.count(zone)
        zone_list.append(zone)
        count_list.append(count)

print(zone_list)
print(count_list)

# 设置显示中文
plt.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体

plt.figure(figsize=(20,15))

plt.bar(range(len(count_list)), count_list,color='r',tick_label=zone_list,facecolor='#9999ff',edgecolor='white')

# 这里是调节横坐标的倾斜度，rotation是度数，以及设置刻度字体大小
plt.xticks(rotation=45,fontsize=20)
plt.yticks(fontsize=20)

plt.legend()
plt.title('''《青春有你2》参赛选手''',fontsize = 24)
plt.savefig('/home/aistudio/work/result/bar_result.jpg')
plt.show()

在这里插入图片描述

import matplotlib.pyplot as plt
import numpy as np 
import json
import matplotlib.font_manager as font_manager
import pandas as pd

#显示matplotlib生成的图形
%matplotlib inline


df = pd.read_json('data/data31557/20200422.json')
#print(df)

grouped=df['name'].groupby(df['zone'])
s = grouped.count()

zone_list = s.index
count_list = s.values


# 设置显示中文
plt.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体

plt.figure(figsize=(20,15))

plt.bar(range(len(count_list)), count_list,color='r',tick_label=zone_list,facecolor='#9999ff',edgecolor='white')

# 这里是调节横坐标的倾斜度，rotation是度数，以及设置刻度字体大小
plt.xticks(rotation=45,fontsize=20)
plt.yticks(fontsize=20)

plt.legend()
plt.title('''《青春有你2》参赛选手''',fontsize = 24)
plt.savefig('/home/aistudio/work/result/bar_result02.jpg')
plt.show()

在这里插入图片描述

import matplotlib.pyplot as plt 
import numpy as np 
import json
import matplotlib.font_manager as font_manager
#显示matplotlib图形
%matplotlib inline

with open('data/data31557/20200422.json','r',encoding='UTF-8') as file:
         json_array = json.loads(file.read())

#绘制小姐姐籍贯直方图，x为籍贯，y数量

weights = []
counts = []

for star in json_array:
    weight = float(star['weight'].replace('kg',''))
    weights.append(weight)
print(weights)

size_list = []
count_list = []

size1 = 0
size2 = 0
size3 = 0
size4 = 0

for weight in weights:
    if weight <=45:
        size1 += 1
    elif 45 < weight <= 50:
        size2 += 1
    elif 50 < weight <= 55:
        size3 += 1
    else:
        size4 += 1

labels = '<=45kg', '45~50kg', '50~55kg', '>55kg'

sizes = [size1, size2, size3, size4]
explode = (0.1, 0.1, 0, 0)

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode = explode, labels = labels, autopct = '%1.1f%%',
        shadow = True, startangle = 90)
ax1.axis('equal')
plt.savefig('/home/aistudio/work/result/pie_result01.jpg')
plt.show()

在这里插入图片描述

import matplotlib.pyplot as plt 
import numpy as np 
import json
import matplotlib.font_manager as font_manager
#显示matplotlib图形
%matplotlib inline

df = pd.read_json('data/data31557/20200422.json')
#print(df)

weights = df['weight']
arrs = weights.values

for i in range(len(arrs)):
    #print(float(arrs[i][0:-2]))
    arrs[i] = float(arrs[i][0:-2])
#print(arrs)

#pandas.cut分割一组数据成离散区间。
bin = [0,45,50,55,100]
se1 = pd.cut(arrs,bin)

#pandas的value_counts()函数可以计数Series里每个值并排序。
pd.value_counts(se1)

labels = '<=45kg', '45~50kg', '50~55kg', '>55kg'
sizes = pd.value_counts(se1)
explode = (0.1, 0.1, 0, 0)

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode = explode, labels = labels, autopct = '%1.1f%%',
        shadow = True, startangle = 90)
ax1.axis('equal')
plt.savefig('/home/aistudio/work/result/pie_result02.jpg')
plt.show()

在这里插入图片描述

day4-《青春有你2》选手识别

PaddleHub之《青春有你2》作业：五人识别
一、任务简介
图像分类是计算机视觉的重要领域，它的目标是将图像分类到预定义的标签。近期，许多研究者提出很多不同种类的神经网络，并且极大的提升了分类算法的性能。本文以自己创建的数据集：青春有你2中选手识别为例子，介绍如何使用PaddleHub进行图像分类任务。
在这里插入图片描述

#CPU环境启动请务必执行该指令
%set_env CPU_NUM=1

#安装paddlehub
!pip install paddlehub==1.6.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

!unzip -o file.zip -d ./dataset/

import paddlehub as hub

module = hub.Module(name="resnet_v2_50_imagenet")

from paddlehub.dataset.base_cv_dataset import BaseCVDataset
   
class DemoDataset(BaseCVDataset):	
   def __init__(self):	
       # 数据集存放位置
       
       self.dataset_dir = "/home/aistudio"
       super(DemoDataset, self).__init__(
           base_path=self.dataset_dir,
           train_list_file="dataset/train_list.txt",
           validate_list_file="dataset/validate_list.txt",
           test_list_file="dataset/test_list.txt",
           label_list_file="dataset/label_list.txt",
           )
dataset = DemoDataset()
print(dataset)

data_reader = hub.reader.ImageClassificationReader(
    image_width=module.get_expected_image_width(),
    image_height=module.get_expected_image_height(),
    images_mean=module.get_pretrained_images_mean(),
    images_std=module.get_pretrained_images_std(),
    dataset=dataset)

config = hub.RunConfig(
    use_cuda=True,                              #是否使用GPU训练，默认为False；
    num_epoch=3,                                #Fine-tune的轮数；
    checkpoint_dir="cv_finetune_turtorial_demo",#模型checkpoint保存路径, 若用户没有指定，程序会自动生成；
    batch_size=3,                              #训练的批大小，如果使用GPU，请根据实际情况调整batch_size；
    eval_interval=10,                           #模型评估的间隔，默认每100个step评估一次验证集；
    strategy=hub.finetune.strategy.DefaultFinetuneStrategy())  #Fine-tune优化策略；

input_dict, output_dict, program = module.context(trainable=True)
img = input_dict["image"]
feature_map = output_dict["feature_map"]
feed_list = [img.name]

task = hub.ImageClassifierTask(
    data_reader=data_reader,
    feed_list=feed_list,
    feature=feature_map,
    num_classes=dataset.num_labels,
    config=config)

run_states = task.finetune_and_eval()

import numpy as np
import matplotlib.pyplot as plt 
import matplotlib.image as mpimg

with open("dataset/test_list.txt","r") as f:
    filepath = f.readlines()

data = [filepath[0].split(" ")[0],filepath[1].split(" ")[0],filepath[2].split(" ")[0],filepath[3].split(" ")[0],filepath[4].split(" ")[0]]

label_map = dataset.label_dict()
index = 0
run_states = task.predict(data=data)
results = [run_state.run_results for run_state in run_states]

for batch_result in results:
    print(batch_result)
    batch_result = np.argmax(batch_result, axis=2)[0]
    print(batch_result)
    for result in batch_result:
        index += 1
        result = label_map[result]
        print("input %i is %s, and the predict result is %s" %
              (index, data[index - 1], result))

在这里插入图片描述

day5-综合大作业

第一步：爱奇艺《青春有你2》评论数据爬取(参考链接：https://www.iqiyi.com/v_19ryfkiv8w.html#curid=15068699100_9f9bab7e0d1e30c494622af777f4ba39)

爬取任意一期正片视频下评论
评论条数不少于1000条

第二步：词频统计并可视化展示

数据预处理：清理清洗评论中特殊字符（如：@#￥%、emoji表情符）,清洗后结果存储为txt文档
中文分词：添加新增词（如：青你、奥利给、冲鸭），去除停用词（如：哦、因此、不然、也好、但是）
统计top10高频词
可视化展示高频词

第三步：绘制词云

根据词频生成词云
可选项-添加背景图片，根据背景图片轮廓生成词云

第四步：结合PaddleHub，对评论进行内容审核

需要的配置和准备

中文分词需要jieba
词云绘制需要wordcloud
可视化展示中需要的中文字体
网上公开资源中找一个中文停用词表
根据分词结果自己制作新增词表
准备一张词云背景图（附加项，不做要求，可用hub抠图实现）
paddlehub配置

!pip install jieba
!pip install wordcloud

# Linux系统默认字体文件路径
# !ls /usr/share/fonts/
# 查看系统可用的ttf格式中文字体
!fc-list :lang=zh | grep ".ttf"

!wget https://mydueros.cdn.bcebos.com/font/simhei.ttf # 下载中文字体
# #创建字体目录fonts
!mkdir .fonts
# # 复制字体文件到该路径
!cp simhei.ttf .fonts/

#安装模型
!hub install porn_detection_lstm==1.1.0
!pip install --upgrade paddlehub

from __future__ import print_function
import requests
import json
import re #正则匹配
import time #时间处理模块
import jieba #中文分词
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as font_manager
from PIL import Image
from wordcloud import WordCloud  #绘制词云模块
import paddlehub as hub

#请求爱奇艺评论接口，返回response信息
def getMoveinfo(url):
    '''
    请求爱奇艺评论接口，返回response信息
    参数  url: 评论的url
    :return: response信息
    '''
    session = requests.Session()
    headers = {
    
        "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1",
        "Accept": "application/json",
        "Referer": "http://m.iqiyi.com/v_19rqriflzg.html",
        "Origin": "http://m.iqiyi.com",
        "Host": "sns-comment.iqiyi.com",
        "Connection": "keep-alive",
        "Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-TW;q=0.6",
        "Accept-Encoding": "gzip, deflate"
    }
    response = session.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None

#解析json数据，获取评论
    '''
    解析json数据，获取评论
    参数  lastId:最后一条评论ID  arr:存放文本的list
    :return: 新的lastId
    '''
def saveMovieInfoToFile(lastId, arr):
    url='https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&authcookie=null&business_type=17&content_id=15068699100&page=&page_size=10&types=time&last_id='
    url+=str(lastId)
    responseTxt = getMoveinfo(url)
    responseJson=json.loads(responseTxt)
    comments=responseJson['data']['comments']
    for val in comments:
        # print(val.keys())
        if 'content' in val.keys():
            print(val['content'])
            arr.append(val['content'])
        lastId = str(val['id'])
    return lastId

#去除文本中特殊字符
def clear_special_char(content):
    '''
    正则处理特殊字符
    参数 content:原文本
    return: 清除后的文本
    '''
    comp = re.compile('[^A-Z^a-z^0-9^\u4e00-\u9fa5]')
    return comp.sub('', content)

text_zh = '$你好！我是个程%序^猿，标注!!#码农￥'
print(clear_special_char(text_zh))

def fenci(text):
    '''
    利用jieba进行分词
    参数 text:需要分词的句子或文本
    return：分词结果
    '''
    # 添加自定义字典 add_words.txt
    # jieba.load_userdict('') 
    jieba.load_userdict('add_words.txt')
    seg=jieba.lcut(text)
    return seg

def stopwordslist(file_path):
    '''
    创建停用词表
    参数 file_path:停用词文本路径
    return：停用词list
    '''
    # f= open(file_path, 'r') 
    # my_data = [i.strip('\n') for i in f]

    stopwords= [line.strip() for line in open(file_path,encoding='UTF-8').readline()]
    return stopwords

def movestopwords(sentence,stopwords,counts):
    '''
    去除停用词,统计词频
    参数 file_path:停用词文本路径 stopwords:停用词list counts: 词频统计结果
    return：None
    '''
    out = []
    for word in sentence:
        if word not in stopwords:
            if len(word) != 1:
                counts[word] = counts.get(word, 0) + 1
    return None

def drawcounts(counts, num):
    '''
    绘制词频统计表
    参数 counts: 词频统计结果 num:绘制topN
    return：none
    '''
    x_aixs=[]
    y_aixs=[]
    c_order=sorted(counts.items(), key=lambda x:x[1],reverse=True)
    for c in c_order[:num]:
        x_aixs.append(c[0])
        y_aixs.append(c[1])
    
    matplotlib.rcParams['font.sans-serif']=['SimHei']
    matplotlib.rcParams['axes.unicode_minus']=False
    plt.bar(x_aixs, y_aixs)
    plt.title('词频统计结果')
    plt.show()

def drawcloud(word_f):
    '''
    根据词频绘制词云图
    参数 word_f:统计出的词频结果
    return：none
    '''    
    cloud_mask=np.array(Image.open('cloud.jpg'))
    st=set(['东西', '这是'])
    wc=WordCloud(background_color='white',
    mask=cloud_mask,
    max_words=150,
    font_path='simhei.ttf',
    min_font_size=10,
    max_font_size=100,
    width=400,
    relative_scaling=0.3,
    stopwords=st)
    wc.fit_words(word_f)
    wc.to_file('pic.png')

def text_detection(text, file_path):
    '''
    使用hub对评论进行内容分析
    return：分析结果

    '''
    porn_detection_lstm=hub.Module(name='porn_detection_lstm')
    f=open('aqy.txt', 'r', encoding='utf-8')
    for line in f:
        if len(line.strip())==1:
            continue
        else:
            test_text.append(line)
    f.close()

    input_dict={
    'text':test_text}
    results=porn_detection_lstm.detection(data=input_dict,use_gpu=True,batch_size=1)
    for index, item in enumerate(results):
        if item['porn_detection_key'] =='porn':
            print(item['text'],':', item['porn_probs'])

#评论是多分页的，得多次请求爱奇艺的评论接口才能获取多页评论,有些评论含有表情、特殊字符之类的
#num 是页数，一页10条评论，假如爬取1000条评论，设置num=100
if __name__ == '__main__':
    num=110
    lastId='0'
    arr=[]
    with open('aqy.txt', 'a', encoding='utf-8') as f:
        for i in range(num):
            lastId=saveMovieInfoToFile(lastId, arr)
            time.sleep(0.5)
        for item in arr:
            item=clear_special_char(item)
            if item.strip()!='':
                try:
                    f.write(item+'\n')
                except  e:
                    print('含有特殊字符')
    print("共获取评论：", len(arr))
    f=open('aqy.txt', 'r', encoding='utf-8')
    counts={
    }
    for line in f:
        words=fenci(line)
        stopwords=stopwordslist(r'./stopwords/中文停用词表.txt')
        movestopwords(words, stopwords, counts)
    drawcounts(counts, 10)
    drawcloud(counts)
    f.close()

    file_path='aqy.txt'
    test_text=[]
    text_detection(test_text, file_path)

在这里插入图片描述

display(Image.open('pic.png')) #显示生成的词云图像

在这里插入图片描述

本文链接：https://blog.csdn.net/weixin_47278555/article/details/105827017

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

使用nginx解决浏览器跨域问题_nginx不停的xhr-程序员宅基地

文章浏览阅读1k次。通过使用ajax方法跨域请求是浏览器所不允许的，浏览器出于安全考虑是禁止的。警告信息如下：不过jQuery对跨域问题也有解决方案，使用jsonp的方式解决，方法如下：$.ajax({ async:false, url: 'http://www.mysite.com/demo.do', // 跨域URL ty..._nginx不停的xhr

在 Oracle 中配置 extproc 以访问 ST_Geometry-程序员宅基地

文章浏览阅读2k次。关于在 Oracle 中配置 extproc 以访问 ST_Geometry，也就是我们所说的使用空间SQL 的方法，官方文档链接如下。http://desktop.arcgis.com/zh-cn/arcmap/latest/manage-data/gdbs-in-oracle/configure-oracle-extproc.htm其实简单总结一下，主要就分为以下几个步骤。..._extproc

Linux C++ gbk转为utf-8_linux c++ gbk->utf8-程序员宅基地

文章浏览阅读1.5w次。linux下没有上面的两个函数，需要使用函数 mbstowcs和wcstombsmbstowcs将多字节编码转换为宽字节编码wcstombs将宽字节编码转换为多字节编码这两个函数，转换过程中受到系统编码类型的影响，需要通过设置来设定转换前和转换后的编码类型。通过函数setlocale进行系统编码的设置。linux下输入命名locale -a查看系统支持的编码_linux c++ gbk->utf8

IMP-00009: 导出文件异常结束-程序员宅基地

文章浏览阅读750次。今天准备从生产库向测试库进行数据导入，结果在imp导入的时候遇到“ IMP-00009:导出文件异常结束” 错误，google一下，发现可能有如下原因导致imp的数据太大，没有写buffer和commit两个数据库字符集不同从低版本exp的dmp文件，向高版本imp导出的dmp文件出错传输dmp文件时，文件损坏解决办法：imp时指定..._imp-00009导出文件异常结束

python程序员需要深入掌握的技能_Python用数据说明程序员需要掌握的技能-程序员宅基地

文章浏览阅读143次。当下是一个大数据的时代，各个行业都离不开数据的支持。因此，网络爬虫就应运而生。网络爬虫当下最为火热的是Python，Python开发爬虫相对简单，而且功能库相当完善，力压众多开发语言。本次教程我们爬取前程无忧的招聘信息来分析Python程序员需要掌握那些编程技术。首先在谷歌浏览器打开前程无忧的首页，按F12打开浏览器的开发者工具。浏览器开发者工具是用于捕捉网站的请求信息，通过分析请求信息可以了解请..._初级python程序员能力要求

Spring @Service生成bean名称的规则（当类的名字是以两个或以上的大写字母开头的话，bean的名字会与类名保持一致）_@service beanname-程序员宅基地

文章浏览阅读7.6k次，点赞2次，收藏6次。@Service标注的bean，类名：ABDemoService查看源码后发现，原来是经过一个特殊处理：当类的名字是以两个或以上的大写字母开头的话，bean的名字会与类名保持一致public class AnnotationBeanNameGenerator implements BeanNameGenerator { private static final String C..._@service beanname

随便推点

二叉树的各种创建方法_二叉树的建立-程序员宅基地

文章浏览阅读6.9w次，点赞73次，收藏463次。1.前序创建#include<stdio.h>#include<string.h>#include<stdlib.h>#include<malloc.h>#include<iostream>#include<stack>#include<queue>using namespace std;typed_二叉树的建立

解决asp.net导出excel时中文文件名乱码_asp.net utf8 导出中文字符乱码-程序员宅基地

文章浏览阅读7.1k次。在Asp.net上使用Excel导出功能，如果文件名出现中文，便会以乱码视之。解决方法： fileName = HttpUtility.UrlEncode(fileName, System.Text.Encoding.UTF8);_asp.net utf8 导出中文字符乱码

笔记-编译原理-实验一-词法分析器设计_对pl/0作以下修改扩充。增加单词-程序员宅基地

文章浏览阅读2.1k次，点赞4次，收藏23次。第一次实验词法分析实验报告设计思想词法分析的主要任务是根据文法的词汇表以及对应约定的编码进行一定的识别，找出文件中所有的合法的单词，并给出一定的信息作为最后的结果，用于后续语法分析程序的使用；本实验针对 PL/0 语言的文法、词汇表编写一个词法分析程序，对于每个单词根据词汇表输出： (单词种类, 单词的值) 二元对。词汇表：种别编码单词符号助记符0beginb..._对pl/0作以下修改扩充。增加单词

android adb shell 权限,android adb shell权限被拒绝-程序员宅基地

文章浏览阅读773次。我在使用adb.exe时遇到了麻烦.我想使用与bash相同的adb.exe shell提示符,所以我决定更改默认的bash二进制文件(当然二进制文件是交叉编译的,一切都很完美)更改bash二进制文件遵循以下顺序> adb remount> adb push bash / system / bin /> adb shell> cd / system / bin> chm..._adb shell mv 权限

投影仪-相机标定_相机-投影仪标定-程序员宅基地

文章浏览阅读6.8k次，点赞12次，收藏125次。1. 单目相机标定引言相机标定已经研究多年，标定的算法可以分为基于摄影测量的标定和自标定。其中，应用最为广泛的还是张正友标定法。这是一种简单灵活、高鲁棒性、低成本的相机标定算法。仅需要一台相机和一块平面标定板构建相机标定系统，在标定过程中，相机拍摄多个角度下（至少两个角度，推荐10~20个角度）的标定板图像（相机和标定板都可以移动），即可对相机的内外参数进行标定。下面介绍张氏标定法（以下也这么称呼）的原理。原理相机模型和单应矩阵相机标定，就是对相机的内外参数进行计算的过程，从而得到物体到图像的投影_相机-投影仪标定

Wayland架构、渲染、硬件支持-程序员宅基地

文章浏览阅读2.2k次。文章目录Wayland 架构Wayland 渲染Wayland的硬件支持简述：　翻译一篇关于和 wayland 有关的技术文章, 其英文标题为Wayland Architecture .Wayland 架构若是想要更好的理解 Wayland 架构及其与 X (X11 or X Window System) 结构；一种很好的方法是将事件从输入设备就开始跟踪, 查看期间所有的屏幕上出现的变化。这就是我们现在对 X 的理解。内核是从一个输入设备中获取一个事件，并通过 evdev 输入_wayland

Python小白逆袭大神七日打卡营飞桨paddlepaddle_path = 'work/' 'pics/' name '/-程序员宅基地

这里写自定义目录标题

Python小白逆袭大神七日打卡营全纪录

day1-Python基础

day2-《青春有你2》选手信息爬取

day3-《青春有你2》选手数据分析

day4-《青春有你2》选手识别

day5-综合大作业

智能推荐

使用nginx解决浏览器跨域问题_nginx不停的xhr-程序员宅基地

在 Oracle 中配置 extproc 以访问 ST_Geometry-程序员宅基地

Linux C++ gbk转为utf-8_linux c++ gbk->utf8-程序员宅基地

IMP-00009: 导出文件异常结束-程序员宅基地

python程序员需要深入掌握的技能_Python用数据说明程序员需要掌握的技能-程序员宅基地

Spring @Service生成bean名称的规则（当类的名字是以两个或以上的大写字母开头的话，bean的名字会与类名保持一致）_@service beanname-程序员宅基地

随便推点

二叉树的各种创建方法_二叉树的建立-程序员宅基地

解决asp.net导出excel时中文文件名乱码_asp.net utf8 导出中文字符乱码-程序员宅基地

笔记-编译原理-实验一-词法分析器设计_对pl/0作以下修改扩充。增加单词-程序员宅基地

android adb shell 权限,android adb shell权限被拒绝-程序员宅基地

投影仪-相机标定_相机-投影仪标定-程序员宅基地

Wayland架构、渲染、硬件支持-程序员宅基地

推荐文章

热门文章

相关标签