### Question :

I’m trying to create a series of dummy variables from a categorical variable using pandas in python. I’ve come across the `get_dummies`

function, but whenever I try to call it I receive an error that the name is not defined.

Any thoughts or other ways to create the dummy variables would be appreciated.

**EDIT**: Since others seem to be coming across this, the `get_dummies`

function in pandas now works perfectly fine. This means the following should work:

```
import pandas as pd
dummies = pd.get_dummies(df['Category'])
```

See http://blog.yhathq.com/posts/logistic-regression-and-python.html for further information.

##
Answer #1:

It’s hard to infer what you’re looking for from the question, but my best guess is as follows.

If we assume you have a DataFrame where some column is ‘Category’ and contains integers (or otherwise unique identifiers) for categories, then we can do the following.

Call the DataFrame `dfrm`

, and assume that for each row, `dfrm['Category']`

is some value in the set of integers from 1 to N. Then,

```
for elem in dfrm['Category'].unique():
dfrm[str(elem)] = dfrm['Category'] == elem
```

Now there will be a new indicator column for each category that is True/False depending on whether the data in that row are in that category.

If you want to control the category names, you could make a dictionary, such as

```
cat_names = {1:'Some_Treatment', 2:'Full_Treatment', 3:'Control'}
for elem in dfrm['Category'].unique():
dfrm[cat_names[elem]] = dfrm['Category'] == elem
```

to result in having columns with specified names, rather than just string conversion of the category values. In fact, for some types, `str()`

may not produce anything useful for you.

##
Answer #2:

When I think of dummy variables I think of using them in the context of OLS regression, and I would do something like this:

```
import numpy as np
import pandas as pd
import statsmodels.api as sm
my_data = np.array([[5, 'a', 1],
[3, 'b', 3],
[1, 'b', 2],
[3, 'a', 1],
[4, 'b', 2],
[7, 'c', 1],
[7, 'c', 1]])
df = pd.DataFrame(data=my_data, columns=['y', 'dummy', 'x'])
just_dummies = pd.get_dummies(df['dummy'])
step_1 = pd.concat([df, just_dummies], axis=1)
step_1.drop(['dummy', 'c'], inplace=True, axis=1)
# to run the regression we want to get rid of the strings 'a', 'b', 'c' (obviously)
# and we want to get rid of one dummy variable to avoid the dummy variable trap
# arbitrarily chose "c", coefficients on "a" an "b" would show effect of "a" and "b"
# relative to "c"
step_1 = step_1.applymap(np.int)
result = sm.OLS(step_1['y'], sm.add_constant(step_1[['x', 'a', 'b']])).fit()
print result.summary()
```

##
Answer #3:

Based on the official documentation:

```
dummies = pd.get_dummies(df['Category']).rename(columns=lambda x: 'Category_' + str(x))
df = pd.concat([df, dummies], axis=1)
df = df.drop(['Category'], inplace=True, axis=1)
```

There is also a nice post in the FastML blog.

##
Answer #4:

The following code returns dataframe with the ‘Category’ column replaced by categorical columns:

```
df_with_dummies = pd.get_dummies(df, prefix='Category_', columns=['Category'])
```

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

##
Answer #5:

For my case, `dmatrices`

in `patsy`

solved my problem. Actually, this function is designed for the generation of dependent and independent variables from a given DataFrame with an R-style formula string. But it can be used for the generation of dummy features from the categorical features. All you need to do would be drop the column ‘Intercept’ that is generated by `dmatrices`

automatically regardless of your original DataFrame.

```
import pandas as pd
from patsy import dmatrices
df_original = pd.DataFrame({
'A': ['red', 'green', 'red', 'green'],
'B': ['car', 'car', 'truck', 'truck'],
'C': [10,11,12,13],
'D': ['alice', 'bob', 'charlie', 'alice']},
index=[0, 1, 2, 3])
_, df_dummyfied = dmatrices('A ~ A + B + C + D', data=df_original, return_type='dataframe')
df_dummyfied = df_dummyfied.drop('Intercept', axis=1)
df_dummyfied.columns
Index([u'A[T.red]', u'B[T.truck]', u'D[T.bob]', u'D[T.charlie]', u'C'], dtype='object')
df_dummyfied
A[T.red] B[T.truck] D[T.bob] D[T.charlie] C
0 1.0 0.0 0.0 0.0 10.0
1 0.0 0.0 1.0 0.0 11.0
2 1.0 1.0 0.0 1.0 12.0
3 0.0 1.0 0.0 0.0 13.0
```

##
Answer #6:

You can create dummy variables to handle the categorical data

```
# Creating dummy variables for categorical datatypes
trainDfDummies = pd.get_dummies(trainDf, columns=['Col1', 'Col2', 'Col3', 'Col4'])
```

This will drop the original columns in **trainDf** and append the column with dummy variables at the end of the **trainDfDummies** dataframe.

**It automatically creates the column names by appending the values at the end of the original column name.**

##
Answer #7:

So I was actually needing an answer to this question today (7/25/2013), so I wrote this earlier. I’ve tested it with some toy examples, hopefully you get some mileage out of it

```
def categorize_dict(x, y=0):
# x Requires string or numerical input
# y is a boolean that specifices whether to return category names along with the dict.
# default is no
cats = list(set(x))
n = len(cats)
m = len(x)
outs = {}
for i in cats:
outs[i] = [0]*m
for i in range(len(x)):
outs[x[i]][i] = 1
if y:
return outs,cats
return outs
```

##
Answer #8:

I created a dummy variable for every state using this code.

```
def create_dummy_column(series, f):
return series.apply(f)
for el in df.area_title.unique():
col_name = el.split()[0] + "_dummy"
f = lambda x: int(x==el)
df[col_name] = create_dummy_column(df.area_title, f)
df.head()
```

More generally, I would just use .apply and pass it an anonymous function with the inequality that defines your category.

(Thank you to @prpl.mnky.dshwshr for the .unique() insight)