使用贝叶斯统计进行垃圾邮件过滤

首页 > C/C++, 算法 > 使用贝叶斯统计进行垃圾邮件过滤

使用贝叶斯统计进行垃圾邮件过滤

2013年10月17日 kulv 发表评论阅读评论 8719次阅读

占个坑，贝叶斯进行垃圾邮件过滤方法简单，效果也挺明显的。基本能满足一般需求。具体算法不多说，google即可。

贝叶斯在线上使用的准确性其实严重依靠人工的调整的，需要进行巧妙的调整才能极大的提高降低误判率。

下面贴个代码，从其他地方考过来改吧改吧的。

#!/usr/bin/env python
#coding:utf-8

import json
import mmseg

class Bayes():

	def __init__(self):
		self.spam_words_count = 0;
		self.healthy_words_count = 0;
		self.words_value = {};

	def learn(self, sentence, is_spam):
		words = mmseg.seg_txt(sentence)
		words = list(words)
		print json.dumps(words, encoding="UTF-8", ensure_ascii=False)
		for flag, word in enumerate(words):
			#print str(flag) + " -- " + word
			pword = False
			if word in self.words_value:
				pword = self.words_value[word]
			if is_spam :
				if not pword:
					self.words_value[word] = '1:0'
				else:
					pword = pword.split(':')
					cs = pword[0]
					ch = pword[1]
					self.words_value[word] = str(int(cs) + 1) + ':' +ch
			else:
				if not pword:
					self.words_value[word] = '0:1'
				else:
					pword = pword.split(':')
					cs = pword[0]
					ch = pword[1]
					self.words_value[word] = cs + ':' +str(int(ch) + 1);

		if is_spam:
			self.spam_words_count = self.spam_words_count + 1
		else:
			self.healthy_words_count = self.healthy_words_count + 1

	def cal(self, sentence):
		total_s = int(self.spam_words_count)
		total_h = int(self.healthy_words_count)
		ps = ph = 0.5   # 在判定之前这个邮件是垃圾邮件的概率
		words_percent = []
		words = mmseg.seg_txt(sentence)
		words = set(words)
		pws = None
		pwh = None

		for flag, word in enumerate(words):
            #每篇文章包含某词则加一，而不是如果有这个词就加一
			value = False
			if word in self.words_value:
				value = self.words_value[word]
			if value:
				tvalue = value.split(':')
				cs = int(tvalue[0])
				ch = int(tvalue[1])
				pws = float(cs) / total_s  # 所有文章中，出现这个为spam的概率
				pwh = float(ch) / total_h  # 所有文章中，出现这个为healthy的概率
			else:
				pws = pwh = 0.01
			if pws == 0:
				pws = 0.01
			if pwh == 0:
				pwh = 0.01
			psw = ps * pws / (ps * pws + pwh * ph)
			words_percent.append(psw)

		j = 1
		for i in words_percent:
			j = j * i
			x = 1
		for i in words_percent:
			x = (1 - i) * x
		judge_value = j / (j + x)
		return judge_value

	def learnFile(self, path, isspam ):
		olines = []
		ofp = open(path, "r")
		try:
			olines = set(ofp.readlines())
		except Exception as e:
			print "read file error!", "file: " + path
			print e
		finally:
			ofp.close();
		for l in olines:
			self.learn( l, isspam)
		return ;

bayes = Bayes()

def main():
    test()

def test():
	testwords = ['想买魔声耳机的加微信gcy_809718332', '不错，好听', '微信不错']
	goodfile = "good.txt"
	badfile = "bad.txt"

	bayes.learnFile(goodfile, False)
	bayes.learnFile(badfile, True)

	for w in testwords:
		pct = bayes.cal(w)
		print w + "  -- " + str(pct)

if __name__ == "__main__":
	main()

下面写几点实际应用中可能需要注意的问题。

1. 垃圾邮件数据的选取，这里需要人工选取足够的垃圾邮件进去，尽量保持客观；

2.正常邮件的样本选取，这个也是得客观选取一些。

上面2点是废话，值得注意的是，很可能大部分的垃圾邮件里面都含有"QQ", "微信" 等，这样可能会导致正常的含有QQ的邮件被标记为垃圾邮件，这个可以通过人工加入足够多的QQ关键词样本邮件，从而变相的条件QQ的概率。

或者也可以通过加白名单过滤的策略，比如移除助词等。

另外，为了提高准确度，其实还可以考虑条件概率，这样做更准确。比如“代理”出现在“广告”之后的话，更像是垃圾邮件。这种方法会更加准确。改天弄一个试试。

分类: C/C++, 算法标签: 垃圾邮件, 贝叶斯

评论 (3) Trackbacks (0) 发表评论 Trackback

你哥

2013年10月17日23:19 | #1

回复 | 引用

要点是一个随机指纹产生器，一个是映射器。思路很清晰的，但是没实际用过。@kulv
kulv

2013年10月17日11:34 | #2

回复 | 引用

@你哥
good，回去看看哈。这个做哈希命中判断的吧
你哥

2013年10月17日01:23 | #3

回复 | 引用

布隆过滤器@《数学之美》

本文目前尚无任何 trackbacks 和 pingbacks.

Redis源码学习-AOF数据持久化原理分析(1) Redis 2.8版部分同步功能源码浅析-Replication Partial Resynchronization

趁着年轻

使用贝叶斯统计进行垃圾邮件过滤

近期文章

近期评论

文章归档

分类目录

我的地址

推荐博客

功能

趁着年轻

使用贝叶斯统计进行垃圾邮件过滤

近期文章

我的标签

近期评论

文章归档

分类目录

我的地址

推荐博客

功能