Solving problem is about exposing yourself to as many situations as possible like Pandas – How to flatten a hierarchical index in columns and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Pandas – How to flatten a hierarchical index in columns, which can be followed any time. Take easy to follow this discuss.
I have a data frame with a hierarchical index in axis 1 (columns) (from a groupby.agg
operation):
USAF WBAN year month day s_PC s_CL s_CD s_CNT tempf
sum sum sum sum amax amin
0 702730 26451 1993 1 1 1 0 12 13 30.92 24.98
1 702730 26451 1993 1 2 0 0 13 13 32.00 24.98
2 702730 26451 1993 1 3 1 10 2 13 23.00 6.98
3 702730 26451 1993 1 4 1 0 12 13 10.04 3.92
4 702730 26451 1993 1 5 3 0 10 13 19.94 10.94
I want to flatten it, so that it looks like this (names aren’t critical – I could rename):
USAF WBAN year month day s_PC s_CL s_CD s_CNT tempf_amax tmpf_amin
0 702730 26451 1993 1 1 1 0 12 13 30.92 24.98
1 702730 26451 1993 1 2 0 0 13 13 32.00 24.98
2 702730 26451 1993 1 3 1 10 2 13 23.00 6.98
3 702730 26451 1993 1 4 1 0 12 13 10.04 3.92
4 702730 26451 1993 1 5 3 0 10 13 19.94 10.94
How do I do this? (I’ve tried a lot, to no avail.)
Per a suggestion, here is the head in dict form
{('USAF', ''): {0: '702730',
1: '702730',
2: '702730',
3: '702730',
4: '702730'},
('WBAN', ''): {0: '26451', 1: '26451', 2: '26451', 3: '26451', 4: '26451'},
('day', ''): {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
('month', ''): {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
('s_CD', 'sum'): {0: 12.0, 1: 13.0, 2: 2.0, 3: 12.0, 4: 10.0},
('s_CL', 'sum'): {0: 0.0, 1: 0.0, 2: 10.0, 3: 0.0, 4: 0.0},
('s_CNT', 'sum'): {0: 13.0, 1: 13.0, 2: 13.0, 3: 13.0, 4: 13.0},
('s_PC', 'sum'): {0: 1.0, 1: 0.0, 2: 1.0, 3: 1.0, 4: 3.0},
('tempf', 'amax'): {0: 30.920000000000002,
1: 32.0,
2: 23.0,
3: 10.039999999999999,
4: 19.939999999999998},
('tempf', 'amin'): {0: 24.98,
1: 24.98,
2: 6.9799999999999969,
3: 3.9199999999999982,
4: 10.940000000000001},
('year', ''): {0: 1993, 1: 1993, 2: 1993, 3: 1993, 4: 1993}}
Answer #1:
I think the easiest way to do this would be to set the columns to the top level:
df.columns = df.columns.get_level_values(0)
Note: if the to level has a name you can also access it by this, rather than 0.
.
If you want to combine/join
your MultiIndex into one Index (assuming you have just string entries in your columns) you could:
df.columns = [' '.join(col).strip() for col in df.columns.values]
Note: we must strip
the whitespace for when there is no second index.
In [11]: [' '.join(col).strip() for col in df.columns.values]
Out[11]:
['USAF',
'WBAN',
'day',
'month',
's_CD sum',
's_CL sum',
's_CNT sum',
's_PC sum',
'tempf amax',
'tempf amin',
'year']
Answer #2:
pd.DataFrame(df.to_records()) # multiindex become columns and new index is integers only
Answer #3:
All of the current answers on this thread must have been a bit dated. As of pandas
version 0.24.0, the .to_flat_index()
does what you need.
From panda’s own documentation:
MultiIndex.to_flat_index()
Convert a MultiIndex to an Index of Tuples containing the level values.
A simple example from its documentation:
import pandas as pd
print(pd.__version__) # '0.23.4'
index = pd.MultiIndex.from_product(
[['foo', 'bar'], ['baz', 'qux']],
names=['a', 'b'])
print(index)
# MultiIndex(levels=[['bar', 'foo'], ['baz', 'qux']],
# codes=[[1, 1, 0, 0], [0, 1, 0, 1]],
# names=['a', 'b'])
Applying to_flat_index()
:
index.to_flat_index()
# Index([('foo', 'baz'), ('foo', 'qux'), ('bar', 'baz'), ('bar', 'qux')], dtype='object')
Using it to replace existing pandas
column
An example of how you’d use it on dat
, which is a DataFrame with a MultiIndex
column:
dat = df.loc[:,['name','workshop_period','class_size']].groupby(['name','workshop_period']).describe()
print(dat.columns)
# MultiIndex(levels=[['class_size'], ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']],
# codes=[[0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6, 7]])
dat.columns = dat.columns.to_flat_index()
print(dat.columns)
# Index([('class_size', 'count'), ('class_size', 'mean'),
# ('class_size', 'std'), ('class_size', 'min'),
# ('class_size', '25%'), ('class_size', '50%'),
# ('class_size', '75%'), ('class_size', 'max')],
# dtype='object')
Answer #4:
Andy Hayden’s answer is certainly the easiest way — if you want to avoid duplicate column labels you need to tweak a bit
In [34]: df
Out[34]:
USAF WBAN day month s_CD s_CL s_CNT s_PC tempf year
sum sum sum sum amax amin
0 702730 26451 1 1 12 0 13 1 30.92 24.98 1993
1 702730 26451 2 1 13 0 13 0 32.00 24.98 1993
2 702730 26451 3 1 2 10 13 1 23.00 6.98 1993
3 702730 26451 4 1 12 0 13 1 10.04 3.92 1993
4 702730 26451 5 1 10 0 13 3 19.94 10.94 1993
In [35]: mi = df.columns
In [36]: mi
Out[36]:
MultiIndex
[(USAF, ), (WBAN, ), (day, ), (month, ), (s_CD, sum), (s_CL, sum), (s_CNT, sum), (s_PC, sum), (tempf, amax), (tempf, amin), (year, )]
In [37]: mi.tolist()
Out[37]:
[('USAF', ''),
('WBAN', ''),
('day', ''),
('month', ''),
('s_CD', 'sum'),
('s_CL', 'sum'),
('s_CNT', 'sum'),
('s_PC', 'sum'),
('tempf', 'amax'),
('tempf', 'amin'),
('year', '')]
In [38]: ind = pd.Index([e[0] + e[1] for e in mi.tolist()])
In [39]: ind
Out[39]: Index([USAF, WBAN, day, month, s_CDsum, s_CLsum, s_CNTsum, s_PCsum, tempfamax, tempfamin, year], dtype=object)
In [40]: df.columns = ind
In [46]: df
Out[46]:
USAF WBAN day month s_CDsum s_CLsum s_CNTsum s_PCsum tempfamax tempfamin
0 702730 26451 1 1 12 0 13 1 30.92 24.98
1 702730 26451 2 1 13 0 13 0 32.00 24.98
2 702730 26451 3 1 2 10 13 1 23.00 6.98
3 702730 26451 4 1 12 0 13 1 10.04 3.92
4 702730 26451 5 1 10 0 13 3 19.94 10.94
year
0 1993
1 1993
2 1993
3 1993
4 1993
Answer #5:
df.columns = ['_'.join(tup).rstrip('_') for tup in df.columns.values]
Answer #6:
And if you want to retain any of the aggregation info from the second level of the multiindex you can try this:
In [1]: new_cols = [''.join(t) for t in df.columns]
Out[1]:
['USAF',
'WBAN',
'day',
'month',
's_CDsum',
's_CLsum',
's_CNTsum',
's_PCsum',
'tempfamax',
'tempfamin',
'year']
In [2]: df.columns = new_cols
Answer #7:
The most pythonic way to do this to use map
function.
df.columns = df.columns.map(' '.join).str.strip()
Output print(df.columns)
:
Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
's_PC sum', 'tempf amax', 'tempf amin', 'year'],
dtype='object')
Update using Python 3.6+ with f string:
df.columns = [f'{f} {s}' if s != '' else f'{f}'
for f, s in df.columns]
print(df.columns)
Output:
Index(['USAF', 'WBAN', 'day', 'month', 's_CD sum', 's_CL sum', 's_CNT sum',
's_PC sum', 'tempf amax', 'tempf amin', 'year'],
dtype='object')
Answer #8:
The easiest and most intuitive solution for me was to combine the column names using get_level_values. This prevents duplicate column names when you do more than one aggregation on the same column:
level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
df.columns = level_one + level_two
If you want a separator between columns, you can do this. This will return the same thing as Seiji Armstrong’s comment on the accepted answer that only includes underscores for columns with values in both index levels:
level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
column_separator = ['_' if x != '' else '' for x in level_two]
df.columns = level_one + column_separator + level_two
I know this does the same thing as Andy Hayden’s great answer above, but I think it is a bit more intuitive this way and is easier to remember (so I don’t have to keep referring to this thread), especially for novice pandas users.
This method is also more extensible in the case where you may have 3 column levels.
level_one = df.columns.get_level_values(0).astype(str)
level_two = df.columns.get_level_values(1).astype(str)
level_three = df.columns.get_level_values(2).astype(str)
df.columns = level_one + level_two + level_three