Quantcast
Channel: 数据科学中的R和Python
Viewing all articles
Browse latest Browse all 85

python贝叶斯文分类识别垃圾短信

$
0
0

Python贝叶斯文本分类识别垃圾短信

1、读取数据,type表示短信类别,text是短信内容

In [17]:
%pylabinline
importpandasaspd
importnumpyasnp
df=pd.read_csv('sms_spam.csv')
df.head()

Populating the interactive namespace from numpy and matplotlib

Out[17]:
typetext
0 ham Hope you are having a good week. Just checking in
1 ham K..give back my thanks.
2 ham Am also doing in cbe only. But have to pay.
3 spam complimentary 4 STAR Ibiza Holiday or £10,000 ...
4 spam okmail: Dear Dave this is your final notice to...

2、使用sklearn包转换文本为结构化数据,将矩阵分切为训练集和检验集

CountVectorizer负责将文档转为文档词频矩阵,重要的参数有如下几个:

  • ngram_range:ngrame频率范围,如果需要识别词组的话需要设置
  • stop_words:停词列表
  • token_pattern:分词的字符模式,默认空格
  • max_df:词频上限,超过该值的词项不作为特征,即过滤常用词
  • min_df:词频下限,低于该值的词项不作为特征
  • max_features:只选择词频较高的几个作为特征
In [18]:
fromsklearn.feature_extraction.textimportCountVectorizer
vectorizer=CountVectorizer(ngram_range=(1,1),stop_words='english',lowercase=True,min_df=1)
X=vectorizer.fit_transform(df.text)
y=(df.type=='spam').values.astype(int)

TfidfVectorizer则可以计算tfidf值,而非仅仅文档词频矩阵

In [19]:
fromsklearn.feature_extraction.textimportTfidfVectorizer
vectorizer=TfidfVectorizer(ngram_range=(1,1),stop_words='english',lowercase=True,min_df=1)
X=vectorizer.fit_transform(df.text)

3、将数据切分为train和test

In [20]:
fromsklearn.cross_validationimporttrain_test_split
xtrain,xtest,ytrain,ytest=train_test_split(X,y)

4、使用贝叶斯分类器进行训练

  • 重要的参数alpha用于设置平滑系数
In [21]:
fromsklearn.naive_bayesimportMultinomialNB
clf=MultinomialNB(alpha=1).fit(xtrain,ytrain)

5、观察分类效果

In [22]:
training_accuracy=clf.score(xtrain,ytrain)
test_accuracy=clf.score(xtest,ytest)
print"训练集准确率: {:.2f}".format(training_accuracy)
print"检验集准确率: {:.2f}".format(test_accuracy)

训练集准确率: 0.98
检验集准确率: 0.97

6、使用CV选择最优参数,参数为0.2

In [23]:
fromsklearnimportsvm,grid_search
nb=MultinomialNB()
parameters={'alpha':np.linspace(0,10,101)}
clf=grid_search.GridSearchCV(nb,parameters)
clf.fit(X,y)
Out[23]:

GridSearchCV(cv=None,
estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
fit_params={}, iid=True, loss_func=None, n_jobs=1,
param_grid={'alpha': array([ 0. , 0.1, ..., 9.9, 10. ])},
pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
verbose=0)
In [100]:
print"最佳参数:  {:.3f}".format(clf.best_params_['alpha'])
print"最佳准确率: {:.3f}".format(clf.best_score_)

最佳参数: 0.200
最佳准确率: 0.984

In [62]:
accuracy=[t[1]fortinclf.grid_scores_]
para=[t[0]['alpha']fortinclf.grid_scores_]
In [94]:
importmatplotlib.pylabasplt
accuracy=[t[1]fortinclf.grid_scores_]
para=[t[0]['alpha']fortinclf.grid_scores_]
plt.plot(para,accuracy,lw=3)
Out[94]:

[]

Viewing all articles
Browse latest Browse all 85

Trending Articles