How to add a new column to an existing DataFrame?

Posted on

Problem :

I have the following indexed DataFrame with named columns and rows not- continuous numbers:

          a         b         c         d
2  0.671399  0.101208 -0.181532  0.241273
3  0.446172 -0.243316  0.051767  1.577318
5  0.614758  0.075793 -0.451460 -0.012493

I would like to add a new column, 'e', to the existing data frame and do not want to change anything in the data frame (i.e., the new column always has the same length as the DataFrame).

0   -0.335485
1   -1.166658
2   -0.385571
dtype: float64

How can I add column e to the above example?

Solution :

This is the simple way of adding a new column: df['e'] = e

Super simple column assignment

A pandas dataframe is implemented as an ordered dict of columns.

This means that the __getitem__ [] can not only be used to get a certain column, but __setitem__ [] = can be used to assign a new column.

For example, this dataframe can have a column added to it by simply using the [] accessor

    size      name color
0    big      rose   red
1  small    violet  blue
2  small     tulip   red
3  small  harebell  blue

df['protected'] = ['no', 'no', 'no', 'yes']

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

Note that this works even if the index of the dataframe is off.

df.index = [3,2,1,0]
df['protected'] = ['no', 'no', 'no', 'yes']
    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

[]= is the way to go, but watch out!

However, if you have a pd.Series and try to assign it to a dataframe where the indexes are off, you will run in to trouble. See example:

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

This is because a pd.Series by default has an index enumerated from 0 to n. And the pandas [] = method tries to be “smart”

What actually is going on.

When you use the [] = method pandas is quietly performing an outer join or outer merge using the index of the left hand dataframe and the index of the right hand series. df['column'] = series

Side note

This quickly causes cognitive dissonance, since the []= method is trying to do a lot of different things depending on the input, and the outcome cannot be predicted unless you just know how pandas works. I would therefore advice against the []= in code bases, but when exploring data in a notebook, it is fine.

Going around the problem

If you have a pd.Series and want it assigned from top to bottom, or if you are coding productive code and you are not sure of the index order, it is worth it to safeguard for this kind of issue.

You could downcast the pd.Series to a np.ndarray or a list, this will do the trick.

df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values

or

df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))

But this is not very explicit.

Some coder may come along and say “Hey, this looks redundant, I’ll just optimize this away”.

Explicit way

Setting the index of the pd.Series to be the index of the df is explicit.

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)

Or more realistically, you probably have a pd.Series already available.

protected_series = pd.Series(['no', 'no', 'no', 'yes'])
protected_series.index = df.index

3     no
2     no
1     no
0    yes

Can now be assigned

df['protected'] = protected_series

    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

Alternative way with df.reset_index()

Since the index dissonance is the problem, if you feel that the index of the dataframe should not dictate things, you can simply drop the index, this should be faster, but it is not very clean, since your function now probably does two things.

df.reset_index(drop=True)
protected_series.reset_index(drop=True)
df['protected'] = protected_series

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

Note on df.assign

While df.assign make it more explicit what you are doing, it actually has all the same problems as the above []=

df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

Just watch out with df.assign that your column is not called self. It will cause errors. This makes df.assign smelly, since there are these kind of artifacts in the function.

df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
TypeError: assign() got multiple values for keyword argument 'self'

You may say, “Well, I’ll just not use self then”. But who knows how this function changes in the future to support new arguments. Maybe your column name will be an argument in a new update of pandas, causing problems with upgrading.

It seems that in recent Pandas versions the way to go is to use df.assign:

df1 = df1.assign(e=np.random.randn(sLength))

It doesn’t produce SettingWithCopyWarning.

Doing this directly via NumPy will be the most efficient:

df1['e'] = np.random.randn(sLength)

Note my original (very old) suggestion was to use map (which is much slower):

df1['e'] = df1['a'].map(lambda x: np.random.random())

Easiest ways:-

data['new_col'] = list_of_values

data.loc[ : , 'new_col'] = list_of_values

This way you avoid what is called chained indexing when setting new values in a pandas object. Click here to read further.

If you want to set the whole new column to an initial base value (e.g. None), you can do this: df1['e'] = None

This actually would assign “object” type to the cell. So later you’re free to put complex data types, like list, into individual cells.

I got the dreaded SettingWithCopyWarning, and it wasn’t fixed by using the iloc syntax. My DataFrame was created by read_sql from an ODBC source. Using a suggestion by lowtech above, the following worked for me:

df.insert(len(df.columns), 'e', pd.Series(np.random.randn(sLength),  index=df.index))

This worked fine to insert the column at the end. I don’t know if it is the most efficient, but I don’t like warning messages. I think there is a better solution, but I can’t find it, and I think it depends on some aspect of the index.
Note. That this only works once and will give an error message if trying to overwrite and existing column.
Note As above and from 0.16.0 assign is the best solution. See documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html#pandas.DataFrame.assign
Works well for data flow type where you don’t overwrite your intermediate values.

To create an empty column

df['i'] = None

If the column you are trying to add is a series variable then just :

df["new_columns_name"]=series_variable_name #this will do it for you

This works well even if you are replacing an existing column.just type the new_columns_name same as the column you want to replace.It will just overwrite the existing column data with the new series data.

If the data frame and Series object have the same index, pandas.concat also works here:

import pandas as pd
df
#          a            b           c           d
#0  0.671399     0.101208   -0.181532    0.241273
#1  0.446172    -0.243316    0.051767    1.577318
#2  0.614758     0.075793   -0.451460   -0.012493

e = pd.Series([-0.335485, -1.166658, -0.385571])    
e
#0   -0.335485
#1   -1.166658
#2   -0.385571
#dtype: float64

# here we need to give the series object a name which converts to the new  column name 
# in the result
df = pd.concat([df, e.rename("e")], axis=1)
df

#          a            b           c           d           e
#0  0.671399     0.101208   -0.181532    0.241273   -0.335485
#1  0.446172    -0.243316    0.051767    1.577318   -1.166658
#2  0.614758     0.075793   -0.451460   -0.012493   -0.385571

In case they don’t have the same index:

e.index = df.index
df = pd.concat([df, e.rename("e")], axis=1)

Foolproof:

df.loc[:, 'NewCol'] = 'New_Val'

Example:

df = pd.DataFrame(data=np.random.randn(20, 4), columns=['A', 'B', 'C', 'D'])

df

           A         B         C         D
0  -0.761269  0.477348  1.170614  0.752714
1   1.217250 -0.930860 -0.769324 -0.408642
2  -0.619679 -1.227659 -0.259135  1.700294
3  -0.147354  0.778707  0.479145  2.284143
4  -0.529529  0.000571  0.913779  1.395894
5   2.592400  0.637253  1.441096 -0.631468
6   0.757178  0.240012 -0.553820  1.177202
7  -0.986128 -1.313843  0.788589 -0.707836
8   0.606985 -2.232903 -1.358107 -2.855494
9  -0.692013  0.671866  1.179466 -1.180351
10 -1.093707 -0.530600  0.182926 -1.296494
11 -0.143273 -0.503199 -1.328728  0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832  0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15  0.955298 -1.430019  1.434071 -0.088215
16 -0.227946  0.047462  0.373573 -0.111675
17  1.627912  0.043611  1.743403 -0.012714
18  0.693458  0.144327  0.329500 -0.655045
19  0.104425  0.037412  0.450598 -0.923387


df.drop([3, 5, 8, 10, 18], inplace=True)

df

           A         B         C         D
0  -0.761269  0.477348  1.170614  0.752714
1   1.217250 -0.930860 -0.769324 -0.408642
2  -0.619679 -1.227659 -0.259135  1.700294
4  -0.529529  0.000571  0.913779  1.395894
6   0.757178  0.240012 -0.553820  1.177202
7  -0.986128 -1.313843  0.788589 -0.707836
9  -0.692013  0.671866  1.179466 -1.180351
11 -0.143273 -0.503199 -1.328728  0.610552
12 -0.923110 -1.365890 -1.366202 -1.185999
13 -2.026832  0.273593 -0.440426 -0.627423
14 -0.054503 -0.788866 -0.228088 -0.404783
15  0.955298 -1.430019  1.434071 -0.088215
16 -0.227946  0.047462  0.373573 -0.111675
17  1.627912  0.043611  1.743403 -0.012714
19  0.104425  0.037412  0.450598 -0.923387

df.loc[:, 'NewCol'] = 0

df
           A         B         C         D  NewCol
0  -0.761269  0.477348  1.170614  0.752714       0
1   1.217250 -0.930860 -0.769324 -0.408642       0
2  -0.619679 -1.227659 -0.259135  1.700294       0
4  -0.529529  0.000571  0.913779  1.395894       0
6   0.757178  0.240012 -0.553820  1.177202       0
7  -0.986128 -1.313843  0.788589 -0.707836       0
9  -0.692013  0.671866  1.179466 -1.180351       0
11 -0.143273 -0.503199 -1.328728  0.610552       0
12 -0.923110 -1.365890 -1.366202 -1.185999       0
13 -2.026832  0.273593 -0.440426 -0.627423       0
14 -0.054503 -0.788866 -0.228088 -0.404783       0
15  0.955298 -1.430019  1.434071 -0.088215       0
16 -0.227946  0.047462  0.373573 -0.111675       0
17  1.627912  0.043611  1.743403 -0.012714       0
19  0.104425  0.037412  0.450598 -0.923387       0

One thing to note, though, is that if you do

df1['e'] = Series(np.random.randn(sLength), index=df1.index)

this will effectively be a left join on the df1.index. So if you want to have an outer join effect, my probably imperfect solution is to create a dataframe with index values covering the universe of your data, and then use the code above. For example,

data = pd.DataFrame(index=all_possible_values)
df1['e'] = Series(np.random.randn(sLength), index=df1.index)

to insert a new column at a given location (0 <= loc <= amount of columns) in a data frame, just use Dataframe.insert:

DataFrame.insert(loc, column, value)

Therefore, if you want to add the column e at the end of a data frame called df, you can use:

e = [-0.335485, -1.166658, -0.385571]    
DataFrame.insert(loc=len(df.columns), column='e', value=e)

value can be a Series, an integer (in which case all cells get filled with this one value), or an array-like structure

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html

Let me just add that, just like for hum3, .loc didn’t solve the SettingWithCopyWarning and I had to resort to df.insert(). In my case false positive was generated by “fake” chain indexing dict['a']['e'], where 'e' is the new column, and dict['a'] is a DataFrame coming from dictionary.

Also note that if you know what you are doing, you can switch of the warning using
pd.options.mode.chained_assignment = None
and than use one of the other solutions given here.

Before assigning a new column, if you have indexed data, you need to sort the index. At least in my case I had to:

data.set_index(['index_column'], inplace=True)
"if index is unsorted, assignment of a new column will fail"        
data.sort_index(inplace = True)
data.loc['index_value1', 'column_y'] = np.random.randn(data.loc['index_value1', 'column_x'].shape[0])

To add a new column, ‘e’, to the existing data frame

 df1.loc[:,'e'] = Series(np.random.randn(sLength))

I was looking for a general way of adding a column of numpy.nans to a dataframe without getting the dumb SettingWithCopyWarning.

From the following:

  • the answers here
  • this question about passing a variable as a keyword argument
  • this method for generating a numpy array of NaNs in-line

I came up with this:

col = 'column_name'
df = df.assign(**{col:numpy.full(len(df), numpy.nan)})

For the sake of completeness – yet another solution using DataFrame.eval() method:

Data:

In [44]: e
Out[44]:
0    1.225506
1   -1.033944
2   -0.498953
3   -0.373332
4    0.615030
5   -0.622436
dtype: float64

In [45]: df1
Out[45]:
          a         b         c         d
0 -0.634222 -0.103264  0.745069  0.801288
4  0.782387 -0.090279  0.757662 -0.602408
5 -0.117456  2.124496  1.057301  0.765466
7  0.767532  0.104304 -0.586850  1.051297
8 -0.103272  0.958334  1.163092  1.182315
9 -0.616254  0.296678 -0.112027  0.679112

Solution:

In [46]: df1.eval("e = @e.values", inplace=True)

In [47]: df1
Out[47]:
          a         b         c         d         e
0 -0.634222 -0.103264  0.745069  0.801288  1.225506
4  0.782387 -0.090279  0.757662 -0.602408 -1.033944
5 -0.117456  2.124496  1.057301  0.765466 -0.498953
7  0.767532  0.104304 -0.586850  1.051297 -0.373332
8 -0.103272  0.958334  1.163092  1.182315  0.615030
9 -0.616254  0.296678 -0.112027  0.679112 -0.622436

If you just need to create a new empty column then the shortest solution is:

df.loc[:, 'e'] = pd.Series()

The following is what I did… But I’m pretty new to pandas and really Python in general, so no promises.

df = pd.DataFrame([[1, 2], [3, 4], [5,6]], columns=list('AB'))

newCol = [3,5,7]
newName = 'C'

values = np.insert(df.values,df.shape[1],newCol,axis=1)
header = df.columns.values.tolist()
header.append(newName)

df = pd.DataFrame(values,columns=header)

If we want to assign a scaler value eg: 10 to all rows of a new column in a df:

df = df.assign(new_col=lambda x:10)  # x is each row passed in to the lambda func

df will now have new column ‘new_col’ with value=10 in all rows.

If you get the SettingWithCopyWarning, an easy fix is to copy the DataFrame you are trying to add a column to.

df = df.copy()
df['col_name'] = values

x=pd.DataFrame([1,2,3,4,5])

y=pd.DataFrame([5,4,3,2,1])

z=pd.concat([x,y],axis=1)

enter image description here

this is a special case of adding a new column to a pandas dataframe. Here, I am adding a new feature/column based on an existing column data of the dataframe.

so, let our dataFrame has columns ‘feature_1’, ‘feature_2’, ‘probability_score’ and we have to add a new_column ‘predicted_class’ based on data in column ‘probability_score’.

I will use map() function from python and also define a function of my own which will implement the logic on how to give a particular class_label to every row in my dataFrame.

data = pd.read_csv('data.csv')

def myFunction(x):
   //implement your logic here

   if so and so:
        return a
   return b

variable_1 = data['probability_score']
predicted_class = variable_1.map(myFunction)

data['predicted_class'] = predicted_class

// check dataFrame, new column is included based on an existing column data for each row
data.head()

import pandas as pd

# Define a dictionary containing data
data = {'a': [0,0,0.671399,0.446172,0,0.614758],
    'b': [0,0,0.101208,-0.243316,0,0.075793],
    'c': [0,0,-0.181532,0.051767,0,-0.451460],
    'd': [0,0,0.241273,1.577318,0,-0.012493]}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data)

# Declare a list that is to be converted into a column
col_e = [-0.335485,-1.166658,-0.385571,0,0,0]


df['e'] = col_e

# add column 'e'
df['e'] = col_e

# Observe the result
df

coding

Whenever you add a Series object as new column to an existing DF, you need to make sure that they both have the same index.
Then add it to the DF

e_series = pd.Series([-0.335485, -1.166658,-0.385571])
print(e_series)
e_series.index = d_f.index
d_f['e'] = e_series
d_f

enter image description here

Leave a Reply

Your email address will not be published. Required fields are marked *