Question :
I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?
I am trying to do the following for feature selection:
-
I read the train file:
num_rows_to_read = 10000 train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read)
-
I change the type of the categorical features to ‘category’:
non_categorial_features = ['orig_destination_distance', 'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'cnt'] for categorical_feature in list(train_small.columns): if categorical_feature not in non_categorial_features: train_small[categorical_feature] = train_small[categorical_feature].astype('category')
-
I use one hot encoding:
train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
The problem is that the 3’rd part often get stuck, although I am using a strong machine.
Thus, without the one hot encoding I can’t do any feature selection, for determining the importance of the features.
What do you recommend?
Answer #1:
Approach 1: You can use pandas’ pd.get_dummies
.
Example 1:
import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]:
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
Example 2:
The following will transform a given column into one hot. Use prefix to have multiple dummies.
import pandas as pd
df = pd.DataFrame({
'A':['a','b','a'],
'B':['b','a','c']
})
df
Out[]:
A B
0 a b
1 b a
2 a c
# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df
Out[]:
A a b c
0 a 0 1 0
1 b 1 0 0
2 a 0 0 1
Approach 2: Use Scikit-learn
Using a OneHotEncoder
has the advantage of being able to fit
on some training data and then transform
on some other data using the same instance. We also have handle_unknown
to further control what the encoder does with unseen data.
Given a dataset with three features and four samples, we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding.
>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9], dtype=int32)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])
Here is the link for this example: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
Answer #2:
Much easier to use Pandas for basic one-hot encoding. If you’re looking for more options you can use scikit-learn
.
For basic one-hot encoding with Pandas you pass your data frame into the get_dummies function.
For example, if I have a dataframe called imdb_movies:
…and I want to one-hot encode the Rated column, I do this:
pd.get_dummies(imdb_movies.Rated)
This returns a new dataframe
with a column for every “level” of rating that exists, along with either a 1 or 0 specifying the presence of that rating for a given observation.
Usually, we want this to be part of the original dataframe
. In this case, we attach our new dummy coded frame onto the original frame using “column-binding.
We can column-bind by using Pandas concat function:
rated_dummies = pd.get_dummies(imdb_movies.Rated)
pd.concat([imdb_movies, rated_dummies], axis=1)
We can now run an analysis on our full dataframe
.
SIMPLE UTILITY FUNCTION
I would recommend making yourself a utility function to do this quickly:
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
return(res)
Usage:
encode_and_bind(imdb_movies, 'Rated')
Result:
Also, as per @pmalbu comment, if you would like the function to remove the original feature_to_encode then use this version:
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
You can encode multiple features at the same time as follows:
features_to_encode = ['feature_1', 'feature_2', 'feature_3',
'feature_4']
for feature in features_to_encode:
res = encode_and_bind(train_set, feature)
Answer #3:
You can do it with numpy.eye
and a using the array element selection mechanism:
import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]
def indices_to_one_hot(data, nb_classes):
"""Convert an iterable of indices to one-hot encoded labels."""
targets = np.array(data).reshape(-1)
return np.eye(nb_classes)[targets]
The the return value of indices_to_one_hot(nb_classes, data)
is now
array([[[ 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 0.]]])
The .reshape(-1)
is there to make sure you have the right labels format (you might also have [[2], [3], [4], [0]]
).
Answer #4:
Firstly, easiest way to one hot encode: use Sklearn.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
Secondly, I don’t think using pandas to one hot encode is that simple (unconfirmed though)
Creating dummy variables in pandas for python
Lastly, is it necessary for you to one hot encode? One hot encoding exponentially increases the number of features, drastically increasing the run time of any classifier or anything else you are going to run. Especially when each categorical feature has many levels. Instead you can do dummy coding.
Using dummy encoding usually works well, for much less run time and complexity. A wise prof once told me, ‘Less is More’.
Here’s the code for my custom encoding function if you want.
from sklearn.preprocessing import LabelEncoder
#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
columnsToEncode = list(df.select_dtypes(include=['category','object']))
le = LabelEncoder()
for feature in columnsToEncode:
try:
df[feature] = le.fit_transform(df[feature])
except:
print('Error encoding '+feature)
return df
EDIT: Comparison to be clearer:
One-hot encoding: convert n levels to n-1 columns.
Index Animal Index cat mouse
1 dog 1 0 0
2 cat --> 2 1 0
3 mouse 3 0 1
You can see how this will explode your memory if you have many different types (or levels) in your categorical feature. Keep in mind, this is just ONE column.
Dummy Coding:
Index Animal Index Animal
1 dog 1 0
2 cat --> 2 1
3 mouse 3 2
Convert to numerical representations instead. Greatly saves feature space, at the cost of a bit of accuracy.
Answer #5:
One hot encoding with pandas is very easy:
def one_hot(df, cols):
"""
@param df pandas DataFrame
@param cols a list of columns to encode
@return a DataFrame with one-hot encoding
"""
for each in cols:
dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
df = pd.concat([df, dummies], axis=1)
return df
EDIT:
Another way to one_hot using sklearn’s LabelBinarizer
:
from sklearn.preprocessing import LabelBinarizer
label_binarizer = LabelBinarizer()
label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later
def one_hot_encode(x):
"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""
return label_binarizer.transform(x)
Answer #6:
You can use numpy.eye function.
import numpy as np
def one_hot_encode(x, n_classes):
"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""
return np.eye(n_classes)[x]
def main():
list = [0,1,2,3,4,3,2,1,0]
n_classes = 5
one_hot_list = one_hot_encode(list, n_classes)
print(one_hot_list)
if __name__ == "__main__":
main()
Result
D:Desktop>python test.py
[[ 1. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 1.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 1. 0. 0. 0. 0.]]
Answer #7:
pandas as has inbuilt function “get_dummies” to get one hot encoding of that particular column/s.
one line code for one-hot-encoding:
df=pd.concat([df,pd.get_dummies(df['column name'],prefix='column name')],axis=1).drop(['column name'],axis=1)
Answer #8:
Here is a solution using DictVectorizer
and the Pandas DataFrame.to_dict('records')
method.
>>> import pandas as pd
>>> X = pd.DataFrame({'income': [100000,110000,90000,30000,14000,50000],
'country':['US', 'CAN', 'US', 'CAN', 'MEX', 'US'],
'race':['White', 'Black', 'Latino', 'White', 'White', 'Black']
})
>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer()
>>> qualitative_features = ['country','race']
>>> X_qual = v.fit_transform(X[qualitative_features].to_dict('records'))
>>> v.vocabulary_
{'country=CAN': 0,
'country=MEX': 1,
'country=US': 2,
'race=Black': 3,
'race=Latino': 4,
'race=White': 5}
>>> X_qual.toarray()
array([[ 0., 0., 1., 0., 0., 1.],
[ 1., 0., 0., 1., 0., 0.],
[ 0., 0., 1., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 1.],
[ 0., 1., 0., 0., 0., 1.],
[ 0., 0., 1., 1., 0., 0.]])