Solving problem is about exposing yourself to as many situations as possible like pandas convert some columns into rows and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about pandas convert some columns into rows, which can be followed any time. Take easy to follow this discuss.
So my dataset has some information by location for n dates. The problem is each date is actually a different column header. For example the CSV looks like
location name Jan-2010 Feb-2010 March-2010
A "test" 12 20 30
B "foo" 18 20 25
What I would like is for it to look like
location name Date Value
A "test" Jan-2010 12
A "test" Feb-2010 20
A "test" March-2010 30
B "foo" Jan-2010 18
B "foo" Feb-2010 20
B "foo" March-2010 25
problem is I don’t know how many dates are in the column (though I know they will always start after name)
Answer #1:
UPDATE
From v0.20, melt
is a first order function, you can now use
df.melt(id_vars=["location", "name"],
var_name="Date",
value_name="Value")
location name Date Value
0 A "test" Jan-2010 12
1 B "foo" Jan-2010 18
2 A "test" Feb-2010 20
3 B "foo" Feb-2010 20
4 A "test" March-2010 30
5 B "foo" March-2010 25
OLD(ER) VERSIONS: <0.20
You can use pd.melt
to get most of the way there, and then sort:
>>> df
location name Jan-2010 Feb-2010 March-2010
0 A test 12 20 30
1 B foo 18 20 25
>>> df2 = pd.melt(df, id_vars=["location", "name"],
var_name="Date", value_name="Value")
>>> df2
location name Date Value
0 A test Jan-2010 12
1 B foo Jan-2010 18
2 A test Feb-2010 20
3 B foo Feb-2010 20
4 A test March-2010 30
5 B foo March-2010 25
>>> df2 = df2.sort(["location", "name"])
>>> df2
location name Date Value
0 A test Jan-2010 12
2 A test Feb-2010 20
4 A test March-2010 30
1 B foo Jan-2010 18
3 B foo Feb-2010 20
5 B foo March-2010 25
(Might want to throw in a .reset_index(drop=True)
, just to keep the output clean.)
Note: pd.DataFrame.sort
has been deprecated in favour of pd.DataFrame.sort_values
.
Answer #2:
Use set_index
with stack
for MultiIndex Series
, then for DataFrame
add reset_index
with rename
:
df1 = (df.set_index(["location", "name"])
.stack()
.reset_index(name='Value')
.rename(columns={'level_2':'Date'}))
print (df1)
location name Date Value
0 A test Jan-2010 12
1 A test Feb-2010 20
2 A test March-2010 30
3 B foo Jan-2010 18
4 B foo Feb-2010 20
5 B foo March-2010 25
Answer #3:
pd.wide_to_long
You can add a prefix to your year columns and then feed directly to pd.wide_to_long
. I won’t pretend this is efficient, but it may in certain situations be more convenient than pd.melt
, e.g. when your columns already have an appropriate prefix.
df.columns = np.hstack((df.columns[:2], df.columns[2:].map(lambda x: f'Value{x}')))
res = pd.wide_to_long(df, stubnames=['Value'], i='name', j='Date').reset_index()
.sort_values(['location', 'name'])
print(res)
name Date location Value
0 test Jan-2010 A 12
2 test Feb-2010 A 20
4 test March-2010 A 30
1 foo Jan-2010 B 18
3 foo Feb-2010 B 20
5 foo March-2010 B 25
Answer #4:
I guess I found a simpler solution
temp1 = pd.melt(df1, id_vars=["location"], var_name='Date', value_name='Value')
temp2 = pd.melt(df1, id_vars=["name"], var_name='Date', value_name='Value')
Concat whole temp1
with temp2
‘s column name
temp1['new_column'] = temp2['name']
You now have what you asked for.