在路上

 找回密码
 立即注册
在路上 站点首页 学习 查看内容

Java 敏感词过滤算法

2017-2-7 13:41| 发布者: zhangjf| 查看: 479| 评论: 0

摘要: 1.DFA算法 DFA算法的原理可以参考 这里 ,简单来说就是通过Map构造出一颗敏感词树,树的每一条由根节点到叶子节点的路径构成一个敏感词,例如下图: 代码简单实现如下:public class TextFilterUtil { //日志 pri ...

1.DFA算法

DFA算法的原理可以参考 这里 ,简单来说就是通过Map构造出一颗敏感词树,树的每一条由根节点到叶子节点的路径构成一个敏感词,例如下图:

代码简单实现如下:

  1. public class TextFilterUtil {
  2. //日志
  3. private static final Logger LOG = LoggerFactory.getLogger(TextFilterUtil.class);
  4. //敏感词库
  5. private static HashMap sensitiveWordMap = null;
  6. //默认编码格式
  7. private static final String ENCODING = "gbk";
  8. //敏感词库的路径
  9. private static final InputStream in = TextFilterUtil.class.getClassLoader().getResourceAsStream("sensitive/keyWords.txt");
  10. /**
  11. * 初始化敏感词库
  12. */
  13. private static void init() {
  14. //读取文件
  15. Set<String> keyWords = readSensitiveWords();
  16. //创建敏感词库
  17. sensitiveWordMap = new HashMap<>(keyWords.size());
  18. for (String keyWord : keyWords) {
  19. createKeyWord(keyWord);
  20. }
  21. }
  22. /**
  23. * 构建敏感词库
  24. *
  25. * @param keyWord
  26. */
  27. private static void createKeyWord(String keyWord) {
  28. if (sensitiveWordMap == null) {
  29. LOG.error("sensitiveWordMap 未初始化!");
  30. return;
  31. }
  32. Map nowMap = sensitiveWordMap;
  33. for (Character c : keyWord.toCharArray()) {
  34. Object obj = nowMap.get(c);
  35. if (obj == null) {
  36. Map<String, Object> childMap = new HashMap<>();
  37. childMap.put("isEnd", "false");
  38. nowMap.put(c, childMap);
  39. nowMap = childMap;
  40. } else {
  41. nowMap = (Map) obj;
  42. }
  43. }
  44. nowMap.put("isEnd", "true");
  45. }
  46. /**
  47. * 读取敏感词文件
  48. *
  49. * @return
  50. */
  51. private static Set<String> readSensitiveWords() {
  52. Set<String> keyWords = new HashSet<>();
  53. BufferedReader reader = null;
  54. try {
  55. reader = new BufferedReader(new InputStreamReader(in, ENCODING));
  56. String line;
  57. while ((line = reader.readLine()) != null) {
  58. keyWords.add(line.trim());
  59. }
  60. } catch (UnsupportedEncodingException e) {
  61. LOG.error("敏感词库文件转码失败!");
  62. } catch (FileNotFoundException e) {
  63. LOG.error("敏感词库文件不存在!");
  64. } catch (IOException e) {
  65. LOG.error("敏感词库文件读取失败!");
  66. } finally {
  67. if (reader != null) {
  68. try {
  69. reader.close();
  70. } catch (IOException e) {
  71. e.printStackTrace();
  72. }
  73. reader = null;
  74. }
  75. }
  76. return keyWords;
  77. }
  78. /**
  79. * 检查敏感词
  80. *
  81. * @return
  82. */
  83. private static List<String> checkSensitiveWord(String text) {
  84. if (sensitiveWordMap == null) {
  85. init();
  86. }
  87. List<String> sensitiveWords = new ArrayList<>();
  88. Map nowMap = sensitiveWordMap;
  89. for (int i = 0; i < text.length(); i++) {
  90. Character word = text.charAt(i);
  91. Object obj = nowMap.get(word);
  92. if (obj == null) {
  93. continue;
  94. }
  95. int j = i + 1;
  96. Map childMap = (Map) obj;
  97. while (j < text.length()) {
  98. if ("true".equals(childMap.get("isEnd"))) {
  99. sensitiveWords.add(text.substring(i, j));
  100. }
  101. obj = childMap.get(text.charAt(j));
  102. if (obj != null) {
  103. childMap = (Map) obj;
  104. } else {
  105. break;
  106. }
  107. j++;
  108. }
  109. }
  110. return sensitiveWords;
  111. }
  112. }
复制代码

2.TTMP算法

TTMP算法由网友原创,关于它的起源可以查看 这里 ,TTMP算法的原理是将敏感词拆分成“脏字”的序列,只有待比对字符串完全由“脏字”组成时,才去判断它是否为敏感词,减少了比对次数。这个算法的简单实现如下:

  1. public class TextFilterUtil {
  2. //日志
  3. private static final Logger LOG = LoggerFactory.getLogger(TextFilterUtil.class);
  4. //默认编码格式
  5. private static final String ENCODING = "gbk";
  6. //敏感词库的路径
  7. private static final InputStream in = TextFilterUtil.class.getClassLoader().getResourceAsStream("sensitive/keyWords.txt");
  8. //脏字库
  9. private static Set<Character> sensitiveCharSet = null;
  10. //敏感词库
  11. private static Set<String> sensitiveWordSet = null;
  12. /**
  13. * 初始化敏感词库
  14. */
  15. private static void init() {
  16. //初始化容器
  17. sensitiveCharSet = new HashSet<>();
  18. sensitiveWordSet = new HashSet<>();
  19. //读取文件 创建敏感词库
  20. readSensitiveWords();
  21. }
  22. /**
  23. * 读取本地的敏感词文件
  24. *
  25. * @return
  26. */
  27. private static void readSensitiveWords() {
  28. BufferedReader reader = null;
  29. try {
  30. reader = new BufferedReader(new InputStreamReader(in, ENCODING));
  31. String line;
  32. while ((line = reader.readLine()) != null) {
  33. String word = line.trim();
  34. sensitiveWordSet.add(word);
  35. for (Character c : word.toCharArray()) {
  36. sensitiveCharSet.add(c);
  37. }
  38. }
  39. } catch (UnsupportedEncodingException e) {
  40. LOG.error("敏感词库文件转码失败!");
  41. } catch (FileNotFoundException e) {
  42. LOG.error("敏感词库文件不存在!");
  43. } catch (IOException e) {
  44. LOG.error("敏感词库文件读取失败!");
  45. } finally {
  46. if (reader != null) {
  47. try {
  48. reader.close();
  49. } catch (IOException e) {
  50. e.printStackTrace();
  51. }
  52. reader = null;
  53. }
  54. }
  55. return;
  56. }
  57. /**
  58. * 检查敏感词
  59. *
  60. * @return
  61. */
  62. private static List<String> checkSensitiveWord(String text) {
  63. if (sensitiveWordSet == null || sensitiveCharSet == null) {
  64. init();
  65. }
  66. List<String> sensitiveWords = new ArrayList<>();
  67. for (int i = 0; i < text.length(); i++) {
  68. Character word = text.charAt(i);
  69. if (!sensitiveCharSet.contains(word)) {
  70. continue;
  71. }
  72. int j = i;
  73. while (j < text.length()) {
  74. if (!sensitiveCharSet.contains(word)) {
  75. break;
  76. }
  77. String key = text.substring(i, j + 1);
  78. if (sensitiveWordSet.contains(key)) {
  79. sensitiveWords.add(key);
  80. }
  81. j++;
  82. }
  83. }
  84. return sensitiveWords;
  85. }
  86. }
复制代码

注:以上代码实现仅用于展示思路,在实际使用中还有很多地方可以优化。

最新评论

小黑屋|在路上 ( 蜀ICP备15035742号-1 

;

GMT+8, 2025-7-9 09:10

Copyright 2015-2025 djqfx

返回顶部