Question :
NLTK package provides a method show_most_informative_features()
to find the most important features for both class, with output like:
contains(outstanding) = True pos : neg = 11.1 : 1.0
contains(seagal) = True neg : pos = 7.7 : 1.0
contains(wonderfully) = True pos : neg = 6.8 : 1.0
contains(damon) = True pos : neg = 5.9 : 1.0
contains(wasted) = True neg : pos = 5.8 : 1.0
As answered in this question How to get most informative features for scikit-learn classifiers? , this can also work in scikit-learn. However, for binary classifier, the answer in that question only outputs the best feature itself.
So my question is, how can I identify the feature’s associated class, like the example above (outstanding is most informative in pos class, and seagal is most informative in negative class)?
EDIT: actually what I want is a list of most informative words for each class. How can I do that? Thanks!
Answer #1:
In the case of binary classification, it seems like the coefficient array has been flatten.
Let’s try to relabel our data with only two labels:
import codecs, re, time
from itertools import chain
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
trainfile = 'train.txt'
# Vectorizing data.
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
tags = ['bs','pt','bs','pt']
# Training NB
mnb = MultinomialNB()
mnb.fit(trainset, tags)
print mnb.classes_
print mnb.coef_[0]
print mnb.coef_[1]
[out]:
['bs' 'pt']
[-5.55682806 -4.86368088 -4.86368088 -5.55682806 -5.55682806 -5.55682806
-4.86368088 -4.86368088 -5.55682806 -5.55682806 -4.86368088 -4.86368088
-4.1705337 -5.55682806 -4.86368088 -5.55682806 -4.86368088 -5.55682806
-5.55682806 -5.55682806 -4.86368088 -4.45821577 -4.86368088 -4.86368088
-4.86368088 -4.86368088 -5.55682806 -4.86368088 -5.55682806 -4.86368088
-4.86368088 -4.86368088 -4.86368088 -4.86368088 -5.55682806 -5.55682806
-5.55682806 -5.55682806 -5.55682806 -4.45821577 -4.86368088 -4.86368088
-4.86368088 -4.86368088 -4.86368088 -5.55682806 -5.55682806 -4.86368088
-4.86368088 -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088
-4.86368088 -5.55682806 -5.55682806 -5.55682806 -5.55682806 -5.55682806
-5.55682806 -5.55682806 -5.55682806 -4.86368088 -4.86368088 -4.86368088
-4.86368088 -5.55682806 -5.55682806 -4.86368088 -5.55682806 -4.86368088
-5.55682806 -5.55682806 -4.86368088 -4.86368088 -4.45821577 -4.86368088
-4.86368088 -4.45821577 -4.86368088 -4.86368088 -4.86368088 -5.55682806
-4.86368088 -5.55682806 -5.55682806 -4.86368088 -5.55682806 -5.55682806
-4.86368088 -5.55682806 -4.86368088 -4.86368088 -4.86368088 -5.55682806
-5.55682806 -5.55682806 -4.86368088 -4.86368088 -5.55682806 -4.86368088
-5.55682806 -4.86368088 -5.55682806 -4.86368088 -5.55682806 -5.55682806
-5.55682806 -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088
-4.86368088 -4.1705337 -4.86368088 -4.86368088 -5.55682806 -4.86368088
-4.86368088 -4.86368088 -4.86368088 -4.86368088 -5.55682806 -4.86368088
-4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088 -4.86368088
-4.86368088 -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088
-5.55682806 -5.55682806 -4.86368088 -4.86368088 -4.86368088 -4.86368088
-4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088 -5.55682806
-4.86368088 -4.45821577 -4.86368088 -4.86368088]
Traceback (most recent call last):
File "test.py", line 24, in <module>
print mnb.coef_[1]
IndexError: index 1 is out of bounds for axis 0 with size 1
So let’s do some diagnostics:
print mnb.feature_count_
print mnb.coef_[0]
[out]:
[[ 1. 0. 0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1.
1. 1. 2. 2. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 2. 1.
1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0.
0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0.
1. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 1. 1.
0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1.
1. 0. 0. 1. 0. 0. 0. 4. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0.
0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0.
0. 0. 1. 0. 0. 1. 0. 0. 0. 0.]
[ 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 3. 0. 1. 0. 1. 0.
0. 0. 1. 2. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 0.
0. 0. 0. 2. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 0. 1. 1.
1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 1. 0. 1.
0. 0. 1. 1. 2. 1. 1. 2. 1. 1. 1. 0. 1. 0. 0. 1. 0. 0.
1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0.
0. 1. 1. 0. 1. 1. 1. 3. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1.
1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 0. 1. 1. 1. 1.
1. 1. 0. 1. 1. 0. 1. 2. 1. 1.]]
[-5.55682806 -4.86368088 -4.86368088 -5.55682806 -5.55682806 -5.55682806
-4.86368088 -4.86368088 -5.55682806 -5.55682806 -4.86368088 -4.86368088
-4.1705337 -5.55682806 -4.86368088 -5.55682806 -4.86368088 -5.55682806
-5.55682806 -5.55682806 -4.86368088 -4.45821577 -4.86368088 -4.86368088
-4.86368088 -4.86368088 -5.55682806 -4.86368088 -5.55682806 -4.86368088
-4.86368088 -4.86368088 -4.86368088 -4.86368088 -5.55682806 -5.55682806
-5.55682806 -5.55682806 -5.55682806 -4.45821577 -4.86368088 -4.86368088
-4.86368088 -4.86368088 -4.86368088 -5.55682806 -5.55682806 -4.86368088
-4.86368088 -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088
-4.86368088 -5.55682806 -5.55682806 -5.55682806 -5.55682806 -5.55682806
-5.55682806 -5.55682806 -5.55682806 -4.86368088 -4.86368088 -4.86368088
-4.86368088 -5.55682806 -5.55682806 -4.86368088 -5.55682806 -4.86368088
-5.55682806 -5.55682806 -4.86368088 -4.86368088 -4.45821577 -4.86368088
-4.86368088 -4.45821577 -4.86368088 -4.86368088 -4.86368088 -5.55682806
-4.86368088 -5.55682806 -5.55682806 -4.86368088 -5.55682806 -5.55682806
-4.86368088 -5.55682806 -4.86368088 -4.86368088 -4.86368088 -5.55682806
-5.55682806 -5.55682806 -4.86368088 -4.86368088 -5.55682806 -4.86368088
-5.55682806 -4.86368088 -5.55682806 -4.86368088 -5.55682806 -5.55682806
-5.55682806 -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088
-4.86368088 -4.1705337 -4.86368088 -4.86368088 -5.55682806 -4.86368088
-4.86368088 -4.86368088 -4.86368088 -4.86368088 -5.55682806 -4.86368088
-4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088 -4.86368088
-4.86368088 -4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088
-5.55682806 -5.55682806 -4.86368088 -4.86368088 -4.86368088 -4.86368088
-4.86368088 -4.86368088 -5.55682806 -4.86368088 -4.86368088 -5.55682806
-4.86368088 -4.45821577 -4.86368088 -4.86368088]
Seems like the features are counted and then when vectorized it was flattened to save memory, so let’s try:
index = 0
coef_features_c1_c2 = []
for feat, c1, c2 in zip(word_vectorizer.get_feature_names(), mnb.feature_count_[0], mnb.feature_count_[1]):
coef_features_c1_c2.append(tuple([mnb.coef_[0][index], feat, c1, c2]))
index+=1
for i in sorted(coef_features_c1_c2):
print i
[out]:
(-5.5568280616995374, u'acuerdo', 1.0, 0.0)
(-5.5568280616995374, u'al', 1.0, 0.0)
(-5.5568280616995374, u'alex', 1.0, 0.0)
(-5.5568280616995374, u'algo', 1.0, 0.0)
(-5.5568280616995374, u'andaba', 1.0, 0.0)
(-5.5568280616995374, u'andrea', 1.0, 0.0)
(-5.5568280616995374, u'bien', 1.0, 0.0)
(-5.5568280616995374, u'buscando', 1.0, 0.0)
(-5.5568280616995374, u'como', 1.0, 0.0)
(-5.5568280616995374, u'con', 1.0, 0.0)
(-5.5568280616995374, u'conseguido', 1.0, 0.0)
(-5.5568280616995374, u'distancia', 1.0, 0.0)
(-5.5568280616995374, u'doprinese', 1.0, 0.0)
(-5.5568280616995374, u'es', 2.0, 0.0)
(-5.5568280616995374, u'estxe1', 1.0, 0.0)
(-5.5568280616995374, u'eulex', 1.0, 0.0)
(-5.5568280616995374, u'excusa', 1.0, 0.0)
(-5.5568280616995374, u'fama', 1.0, 0.0)
(-5.5568280616995374, u'guasch', 1.0, 0.0)
(-5.5568280616995374, u'ha', 1.0, 0.0)
(-5.5568280616995374, u'incident', 1.0, 0.0)
(-5.5568280616995374, u'ispit',