Question :
Column names are: ID,1,2,3,4,5,6,7,8,9.
The col values are either 0 or 1
My dataframe looks like this:
ID 1 2 3 4 5 6 7 8 9
1002 0 1 0 1 0 0 0 0 0
1003 0 0 0 0 0 0 0 0 0
1004 1 1 0 0 0 0 0 0 0
1005 0 0 0 0 1 0 0 0 0
1006 0 0 0 0 0 1 0 0 0
1007 1 0 1 0 0 0 0 0 0
1000 0 0 0 0 0 0 0 0 0
1009 0 0 1 0 0 0 1 0 0
I want the column names in front of the ID where the value in a row is 1.
The Dataframe i want should look like this:
ID Col2
1002 2 // has 1 at Col(2) and Col(4)
1002 4
1004 1 // has 1 at col(1) and col(2)
1004 2
1005 5 // has 1 at col(5)
1006 6 // has 1 at col(6)
1007 1 // has 1 at col(1) and col(3)
1007 3
1009 3 // has 1 at col(3) and col(7)
1009 7
Please help me in this, Thanks in advance
Answer #1:
set_index
+ stack
, stack will dropna by default
df.set_index('ID',inplace=True)
df[df==1].stack().reset_index().drop(0,1)
Out[363]:
ID level_1
0 1002 2
1 1002 4
2 1004 1
3 1004 2
4 1005 5
5 1006 6
6 1007 1
7 1007 3
8 1009 3
9 1009 7
Answer #2:
Several great answers for the OP post. However, often get_dummies
is used for multiple categorical features. Pandas uses a prefix separator prefix_sep
to distinguish different values for a column.
The following function collapses a “dummified” dataframe while keeping the order of columns:
def undummify(df, prefix_sep="_"):
cols2collapse = {
item.split(prefix_sep)[0]: (prefix_sep in item) for item in df.columns
}
series_list = []
for col, needs_to_collapse in cols2collapse.items():
if needs_to_collapse:
undummified = (
df.filter(like=col)
.idxmax(axis=1)
.apply(lambda x: x.split(prefix_sep, maxsplit=1)[1])
.rename(col)
)
series_list.append(undummified)
else:
series_list.append(df[col])
undummified_df = pd.concat(series_list, axis=1)
return undummified_df
Example
>>> df
a b c
0 A_1 B_1 C_1
1 A_2 B_2 C_2
>>> df2 = pd.get_dummies(df)
>>> df2
a_A_1 a_A_2 b_B_1 b_B_2 c_C_1 c_C_2
0 1 0 1 0 1 0
1 0 1 0 1 0 1
>>> df3 = undummify(df2)
>>> df3
a b c
0 A_1 B_1 C_1
1 A_2 B_2 C_2
Answer #3:
Pretty one-liner 🙂
new_df = df.idxmax(axis=1)
Answer #4:
np.argwhere
v = np.argwhere(df.drop('ID', 1).values).T
pd.DataFrame({'ID' : df.loc[v[0], 'ID'], 'Col2' : df.columns[1:][v[1]]})
Col2 ID
0 2 1002
0 4 1002
2 1 1004
2 2 1004
3 5 1005
4 6 1006
5 1 1007
5 3 1007
7 3 1009
7 7 1009
argwhere
gets the i, j indices of all non-zero elements in your DataFrame. Use the first column of indices to index into column ID
, and the second column of indices to index into df.columns
.
I transpose v
before step 2 for cache efficiency, and less typing.
Answer #5:
Use:
df = (df.melt('ID', var_name='Col2')
.query('value== 1')
.sort_values(['ID', 'Col2'])
.drop('value',1))
Alternative solution:
df = (df.set_index('ID')
.mask(lambda x: x == 0)
.stack()
.reset_index()
.drop(0,1))
print (df)
ID Col2
8 1002 2
24 1002 4
2 1004 1
10 1004 2
35 1005 5
44 1006 6
5 1007 1
21 1007 3
23 1009 3
55 1009 7
Explanation:
-
sort_values
for first solution -
create columns from
MultiIndex
byreset_index
-
Last remove unnecessary columns by
drop
Answer #6:
https://stackoverflow.com/a/55757342/2384397
rewriting here:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
dat["labels"]= le.fit_transform(dat["classification"])
Y= pd.get_dummies(dat["labels"])
tru=[]
for i in range(0, len(Y)):
tru.append(np.argmax(Y.iloc[i]))
tru= le.inverse_transform(tru)
#Identical check!
(tru==dat["classification"]).value_counts()