Python实现敏感词过滤-程序员宅基地

在我们生活中的一些场合经常会有一些不该出现的敏感词，我们通常会使用*去屏蔽它，例如：尼玛 -> **，一些骂人的敏感词和一些政治敏感词都不应该出现在一些公共场合中，这个时候我们就需要一定的手段去屏蔽这些敏感词。下面我来介绍一些简单版本的敏感词屏蔽的方法。

（我已经尽量把脏话做成图片的形式了，要不然文章发不出去）

方法一：replace过滤

replace就是最简单的字符串替换，当一串字符串中有可能会出现的敏感词时，我们直接使用相应的replace方法用*替换出敏感词即可。

缺点：

文本和敏感词少的时候还可以，多的时候效率就比较差了

import datetime
now = datetime.datetime.now()
print(filter_sentence, " | ", now)

如果是多个敏感词可以用列表进行逐一替换

for i in dirty:
    speak = speak.replace(i, '*')
print(speak, " | ", now)

方法二：正则表达式过滤

正则表达式算是一个不错的匹配方法了，日常的查询中，机会都会用到正则表达式，包括我们的爬虫，也都是经常会使用到正则表达式的，在这里我们主要是使用“|”来进行匹配，“|”的意思是从多个目标字符串中选择一个进行匹配。写个简单的例子：

import re

def sentence_filter(keywords, text):
    return re.sub("|".join(keywords), "***", text)

print(sentence_filter(dirty, speak))

方法三：DFA过滤算法

DFA的算法，即Deterministic Finite Automaton算法，翻译成中文就是确定有穷自动机算法。它的基本思想是基于状态转移来检索敏感词，只需要扫描一次待检测文本，就能对所有敏感词进行检测。（实现见代码注释）

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# @Time：2020/4/15 11:40
# @Software：PyCharm
# article_add: https://www.cnblogs.com/JentZhang/p/12718092.html
__author__ = "JentZhang"
import json

MinMatchType = 1  # 最小匹配规则
MaxMatchType = 2  # 最大匹配规则


class DFAUtils(object):
    """
    DFA算法
    """

    def __init__(self, word_warehouse):
        """
        算法初始化
        :param word_warehouse:词库
        """
        # 词库
        self.root = dict()
        # 无意义词库,在检测中需要跳过的（这种无意义的词最后有个专门的地方维护，保存到数据库或者其他存储介质中）
        self.skip_root = [' ', '&', '!', '！', '@', '#', '$', '￥', '*', '^', '%', '?', '？', '<', '>', "《", '》']
        # 初始化词库
        for word in word_warehouse:
            self.add_word(word)

    def add_word(self, word):
        """
        添加词库
        :param word:
        :return:
        """
        now_node = self.root
        word_count = len(word)
        for i in range(word_count):
            char_str = word[i]
            if char_str in now_node.keys():
                # 如果存在该key，直接赋值，用于下一个循环获取
                now_node = now_node.get(word[i])
                now_node['is_end'] = False
            else:
                # 不存在则构建一个dict
                new_node = dict()

                if i == word_count - 1:  # 最后一个
                    new_node['is_end'] = True
                else:  # 不是最后一个
                    new_node['is_end'] = False

                now_node[char_str] = new_node
                now_node = new_node

    def check_match_word(self, txt, begin_index, match_type=MinMatchType):
        """
        检查文字中是否包含匹配的字符
        :param txt:待检测的文本
        :param begin_index: 调用getSensitiveWord时输入的参数，获取词语的上边界index
        :param match_type:匹配规则 1：最小匹配规则，2：最大匹配规则
        :return:如果存在，则返回匹配字符的长度，不存在返回0
        """
        flag = False
        match_flag_length = 0  # 匹配字符的长度
        now_map = self.root
        tmp_flag = 0  # 包括特殊字符的敏感词的长度

        for i in range(begin_index, len(txt)):
            word = txt[i]

            # 检测是否是特殊字符"
            if word in self.skip_root and len(now_map) < 100:
                # len(nowMap)<100 保证已经找到这个词的开头之后出现的特殊字符
                tmp_flag += 1
                continue

            # 获取指定key
            now_map = now_map.get(word)
            if now_map:  # 存在，则判断是否为最后一个
                # 找到相应key，匹配标识+1
                match_flag_length += 1
                tmp_flag += 1
                # 如果为最后一个匹配规则，结束循环，返回匹配标识数
                if now_map.get("is_end"):
                    # 结束标志位为true
                    flag = True
                    # 最小规则，直接返回,最大规则还需继续查找
                    if match_type == MinMatchType:
                        break
            else:  # 不存在，直接返回
                break

        if tmp_flag < 2 or not flag:  # 长度必须大于等于1，为词
            tmp_flag = 0
        return tmp_flag

    def get_match_word(self, txt, match_type=MinMatchType):
        """
        获取匹配到的词语
        :param txt:待检测的文本
        :param match_type:匹配规则 1：最小匹配规则，2：最大匹配规则
        :return:文字中的相匹配词
        """
        matched_word_list = list()
        for i in range(len(txt)):  # 0---11
            length = self.check_match_word(txt, i, match_type)
            if length > 0:
                word = txt[i:i + length]
                matched_word_list.append(word)
                # i = i + length - 1
        return matched_word_list

    def is_contain(self, txt, match_type=MinMatchType):
        """
        判断文字是否包含敏感字符
        :param txt:待检测的文本
        :param match_type:匹配规则 1：最小匹配规则，2：最大匹配规则
        :return:若包含返回true，否则返回false
        """
        flag = False
        for i in range(len(txt)):
            match_flag = self.check_match_word(txt, i, match_type)
            if match_flag > 0:
                flag = True
        return flag

    def replace_match_word(self, txt, replace_char='*', match_type=MinMatchType):
        """
        替换匹配字符
        :param txt:待检测的文本
        :param replace_char:用于替换的字符，匹配的敏感词以字符逐个替换，如"你是大王八"，敏感词"王八"，替换字符*，替换结果"你是大**"
        :param match_type:匹配规则 1：最小匹配规则，2：最大匹配规则
        :return:替换敏感字字符后的文本
        """
        tuple_set = self.get_match_word(txt, match_type)
        word_set = [i for i in tuple_set]
        result_txt = ""
        if len(word_set) > 0:  # 如果检测出了敏感词，则返回替换后的文本
            for word in word_set:
                replace_string = len(word) * replace_char
                txt = txt.replace(word, replace_string)
                result_txt = txt
        else:  # 没有检测出敏感词，则返回原文本
            result_txt = txt
        return result_txt


if __name__ == '__main__':
    dfa = DFAUtils(word_warehouse=word_warehouse)
    print('词库结构：', json.dumps(dfa.root, ensure_ascii=False))
    # 待检测的文本
    msg = msg
    print('是否包含：', dfa.is_contain(msg))
    print('相匹配的词：', dfa.get_match_word(msg))
    print('替换包含的词：', dfa.replace_match_word(msg))

方法四：AC自动机

AC自动机需要有前置知识：Trie树（简单介绍：又称前缀树，字典树，是用于快速处理字符串的问题，能做到快速查找到一些字符串上的信息。）

详细参考：

https://www.luogu.com.cn/blog/juruohyfhaha/trie-xue-xi-zong-jie

ac自动机,就是在tire树的基础上,增加一个fail指针,如果当前点匹配失败,则将指针转移到fail指针指向的地方,这样就不用回溯,而可以路匹配下去了。

详细匹配机制我在这里不过多赘述，关于AC自动机可以参考一下这篇文章：

https://blog.csdn.net/bestsort/article/details/82947639

python可以利用ahocorasick模块快速实现：

# python3 -m pip install pyahocorasick
import ahocorasick

def build_actree(wordlist):
    actree = ahocorasick.Automaton()
    for index, word in enumerate(wordlist):
        actree.add_word(word, (index, word))
    actree.make_automaton()
    return actree

if __name__ == '__main__':
    actree = build_actree(wordlist=wordlist)
    sent_cp = sent
    for i in actree.iter(sent):
        sent_cp = sent_cp.replace(i[1][1], "**")
        print("屏蔽词：",i[1][1])
    print("屏蔽结果：",sent_cp)

当然，我们也可以手写一份AC自动机，具体参考：

class TrieNode(object):
    __slots__ = ['value', 'next', 'fail', 'emit']

    def __init__(self, value):
        self.value = value
        self.next = dict()
        self.fail = None
        self.emit = None


class AhoCorasic(object):
    __slots__ = ['_root']

    def __init__(self, words):
        self._root = AhoCorasic._build_trie(words)

    @staticmethod
    def _build_trie(words):
        assert isinstance(words, list) and words
        root = TrieNode('root')
        for word in words:
            node = root
            for c in word:
                if c not in node.next:
                    node.next[c] = TrieNode(c)
                node = node.next[c]
            if not node.emit:
                node.emit = {word}
            else:
                node.emit.add(word)
        queue = []
        queue.insert(0, (root, None))
        while len(queue) > 0:
            node_parent = queue.pop()
            curr, parent = node_parent[0], node_parent[1]
            for sub in curr.next.itervalues():
                queue.insert(0, (sub, curr))
            if parent is None:
                continue
            elif parent is root:
                curr.fail = root
            else:
                fail = parent.fail
                while fail and curr.value not in fail.next:
                    fail = fail.fail
                if fail:
                    curr.fail = fail.next[curr.value]
                else:
                    curr.fail = root
        return root

    def search(self, s):
        seq_list = []
        node = self._root
        for i, c in enumerate(s):
            matched = True
            while c not in node.next:
                if not node.fail:
                    matched = False
                    node = self._root
                    break
                node = node.fail
            if not matched:
                continue
            node = node.next[c]
            if node.emit:
                for _ in node.emit:
                    from_index = i + 1 - len(_)
                    match_info = (from_index, _)
                    seq_list.append(match_info)
                node = self._root
        return seq_list


if __name__ == '__main__':
    aho = AhoCorasic(['foo', 'bar'])
    print aho.search('barfoothefoobarman')

以上便是使用Python实现敏感词过滤的四种方法，前面两种方法比较简单，后面两种偏向算法，需要先了解算法具体实现的原理，之后代码就好懂了。（DFA作为比较常用的过滤手段，建议大家掌握一下~）

最后附上敏感词词库：

https://github.com/qloog/sensitive_words



以上，便是今天的内容，希望大家喜欢，欢迎「转发」或者点击「在看」支持，谢谢各位。




“扫一扫，关注我吧”

本文链接：https://blog.csdn.net/tongtongjing1765/article/details/105963611

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

c# 调用c++ lib静态库_c#调用lib-程序员宅基地

文章浏览阅读2w次，点赞7次，收藏51次。四个步骤1.创建C++ Win32项目动态库dll 2.在Win32项目动态库中添加外部依赖项 lib头文件和lib库3.导出C接口4.c#调用c++动态库开始你的表演...①创建一个空白的解决方案，在解决方案中添加 Visual C++ , Win32 项目空白解决方案的创建：添加Visual C++ , Win32 项目这......_c#调用lib

deepin/ubuntu安装苹方字体-程序员宅基地

文章浏览阅读4.6k次。苹方字体是苹果系统上的黑体，挺好看的。注重颜值的网站都会使用，例如知乎：font-family: -apple-system, BlinkMacSystemFont, Helvetica Neue, PingFang SC, Microsoft YaHei, Source Han Sans SC, Noto Sans CJK SC, W..._ubuntu pingfang

html表单常见操作汇总_html表单的处理程序有那些-程序员宅基地

文章浏览阅读159次。表单表单概述表单标签表单域按钮控件demo表单标签表单标签基本语法结构<form action="处理数据程序的url地址“ method=”get|post“ name="表单名称”></form><!--method将表单中的数据传送给服务器处理，get方式直接显示在url地址中，数据可以被缓存，且长度有限制；而post方式数据隐藏传输，_html表单的处理程序有那些

PHP设置谷歌验证器（Google Authenticator）实现操作二步验证_php otp 验证器-程序员宅基地

文章浏览阅读1.2k次。使用说明:开启Google的登陆二步验证（即Google Authenticator服务）后用户登陆时需要输入额外由手机客户端生成的一次性密码。实现Google Authenticator功能需要服务器端和客户端的支持。服务器端负责密钥的生成、验证一次性密码是否正确。客户端记录密钥后生成一次性密码。下载谷歌验证类库文件放到项目合适位置(我这边放在项目Vender下面)https://github.com/PHPGangsta/GoogleAuthenticatorPHP代码示例://引入谷_php otp 验证器

【Python】matplotlib.plot画图横坐标混乱及间隔处理_matplotlib更改横轴间距-程序员宅基地

文章浏览阅读4.3k次，点赞5次，收藏11次。matplotlib.plot画图横坐标混乱及间隔处理_matplotlib更改横轴间距

docker — 容器存储_docker 保存容器-程序员宅基地

文章浏览阅读2.2k次。①Storage driver 处理各镜像层及容器层的处理细节，实现了多层数据的堆叠，为用户提供了多层数据合并后的统一视图②所有 Storage driver 都使用可堆叠图像层和写时复制（CoW）策略③docker info 命令可查看当系统上的 storage driver主要用于测试目的，不建议用于生成环境。_docker 保存容器

随便推点

网络拓扑结构_网络拓扑csdn-程序员宅基地

文章浏览阅读834次，点赞27次，收藏13次。网络拓扑结构是指计算机网络中各组件（如计算机、服务器、打印机、路由器、交换机等设备）及其连接线路在物理布局或逻辑构型上的排列形式。这种布局不仅描述了设备间的实际物理连接方式，也决定了数据在网络中流动的路径和方式。不同的网络拓扑结构影响着网络的性能、可靠性、可扩展性及管理维护的难易程度。_网络拓扑csdn

JS重写Date函数，兼容IOS系统_date.prototype 将所有 ios-程序员宅基地

文章浏览阅读1.8k次，点赞5次，收藏8次。IOS系统Date的坑要创建一个指定时间的new Date对象时，通常的做法是：new Date("2020-09-21 11:11:00")这行代码在 PC 端和安卓端都是正常的，而在 iOS 端则会提示 Invalid Date 无效日期。在IOS年月日中间的横岗许换成斜杠，也就是new Date("2020/09/21 11:11:00")通常为了兼容IOS的这个坑，需要做一些额外的特殊处理，笔者在开发的时候经常会忘了兼容IOS系统。所以就想试着重写Date函数，一劳永逸，避免每次ne_date.prototype 将所有 ios