Create pandas Dataframe by appending one row at a time

Posted on

Solving problem is about exposing yourself to as many situations as possible like Create pandas Dataframe by appending one row at a time and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Create pandas Dataframe by appending one row at a time, which can be followed any time. Take easy to follow this discuss.

Create pandas Dataframe by appending one row at a time

I understand that pandas is designed to load fully populated DataFrame but I need to create an empty DataFrame then add rows, one by one.
What is the best way to do this ?

I successfully created an empty DataFrame with :

res = DataFrame(columns=('lib', 'qty1', 'qty2'))

Then I can add a new row and fill a field with :

res = res.set_value(len(res), 'qty1', 10.0)

It works but seems very odd :-/ (it fails for adding string value)

How can I add a new row to my DataFrame (with different columns type) ?

Asked By: PhE

||

Answer #1:

You can use df.loc[i], where the row with index i will be what you specify it to be in the dataframe.

>>> import pandas as pd
>>> from numpy.random import randint
>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>>     df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))
>>> df
     lib qty1 qty2
0  name0    3    3
1  name1    2    4
2  name2    2    8
3  name3    2    1
4  name4    9    6
Answered By: fred

Answer #2:

In case you can get all data for the data frame upfront, there is a much faster approach than appending to a data frame:

  1. Create a list of dictionaries in which each dictionary corresponds to an input data row.
  2. Create a data frame from this list.

I had a similar task for which appending to a data frame row by row took 30 min, and creating a data frame from a list of dictionaries completed within seconds.

rows_list = []
for row in input_rows:
        dict1 = {}
        # get input row in dictionary format
        # key = col_name
        dict1.update(blah..)
        rows_list.append(dict1)
df = pd.DataFrame(rows_list)
Answered By: ShikharDua

Answer #3:

You could use pandas.concat() or DataFrame.append(). For details and examples, see Merge, join, and concatenate.

Answered By: NPE

Answer #4:

It’s been a long time, but I faced the same problem too. And found here a lot of interesting answers. So I was confused what method to use.

In the case of adding a lot of rows to dataframe I interested in speed performance. So I tried 4 most popular methods and checked their speed.

UPDATED IN 2019 using new versions of packages.
Also updated after @FooBar comment

SPEED PERFORMANCE

  1. Using .append (NPE’s answer)
  2. Using .loc (fred’s answer)
  3. Using .loc with preallocating (FooBar’s answer)
  4. Using dict and create DataFrame in the end (ShikharDua’s answer)

Results (in secs):

|------------|-------------|-------------|-------------|
|  Approach  |  1000 rows  |  5000 rows  | 10 000 rows |
|------------|-------------|-------------|-------------|
| .append    |    0.69     |    3.39     |    6.78     |
|------------|-------------|-------------|-------------|
| .loc w/o   |    0.74     |    3.90     |    8.35     |
| prealloc   |             |             |             |
|------------|-------------|-------------|-------------|
| .loc with  |    0.24     |    2.58     |    8.70     |
| prealloc   |             |             |             |
|------------|-------------|-------------|-------------|
|  dict      |    0.012    |   0.046     |   0.084     |
|------------|-------------|-------------|-------------|

Also thanks to @krassowski for useful comment – I updated the code.

So I use addition through the dictionary for myself.


Code:

import pandas as pd
import numpy as np
import time
del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
    df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)
# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
    df2.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)
# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
    df3.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)
# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
    row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
    dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
    row_list.append(dict1)
df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)

P.S. I believe, my realization isn’t perfect, and maybe there is some optimization.

Answered By: Mikhail_Sam

Answer #5:

If you know the number of entries ex ante, you should preallocate the space by also providing the index (taking the data example from a different answer):

import pandas as pd
import numpy as np
# we know we're gonna have 5 rows of data
numberOfRows = 5
# create dataframe
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )
# now fill it up row by row
for x in np.arange(0, numberOfRows):
    #loc or iloc both work here since the index is natural numbers
    df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
In[23]: df
Out[23]:
   lib  qty1  qty2
0   -1    -1    -1
1    0     0     0
2   -1     0    -1
3    0    -1     0
4   -1     0     0

Speed comparison

In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, @fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop

And – as from the comments – with a size of 6000, the speed difference becomes even larger:

Increasing the size of the array (12) and the number of rows (500) makes
the speed difference more striking: 313ms vs 2.29s

Answered By: FooBar

Answer #6:

NEVER grow a DataFrame!

Yes, people have already explained that you should NEVER grow a DataFrame, and that you should append your data to a list and convert it to a DataFrame once at the end. But do you understand why?

Here are the most important reasons, taken from my post here.

  1. It is always cheaper/faster to append to a list and create a DataFrame in one go.
  2. Lists take up less memory and are a much lighter data structure to work with, append, and remove.
  3. dtypes are automatically inferred for your data. On the flip side, creating an empty frame of NaNs will automatically make them object, which is bad.
  4. An index is automatically created for you, instead of you having to take care to assign the correct index to the row you are appending.

This is The Right Way™ to accumulate your data

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])

These options are horrible

  1. append or concat inside a loop

    append and concat aren’t inherently bad in isolation. The
    problem starts when you iteratively call them inside a loop – this
    results in quadratic memory usage.

    # Creates empty DataFrame and appends
    df = pd.DataFrame(columns=['A', 'B', 'C'])
    for a, b, c in some_function_that_yields_data():
        df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True)
        # This is equally bad:
        # df = pd.concat(
        #       [df, pd.Series({'A': i, 'B': b, 'C': c})], 
        #       ignore_index=True)
    
  2. Empty DataFrame of NaNs

    Never create a DataFrame of NaNs as the columns are initialized with
    object (slow, un-vectorizable dtype).

    # Creates DataFrame of NaNs and overwrites values.
    df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
    for a, b, c in some_function_that_yields_data():
        df.loc[len(df)] = [a, b, c]
    

The Proof is in the Pudding

Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.

enter image description here

Benchmarking code for reference.


It’s posts like this that remind me why I’m a part of this community. People understand the importance of teaching folks getting the right answer with the right code, not the right answer with wrong code. Now you might argue that it is not an issue to use loc or append if you’re only adding a single row to your DataFrame. However, people often look to this question to add more than just one row – often the requirement is to iteratively add a row inside a loop using data that comes from a function (see related question). In that case it is important to understand that iteratively growing a DataFrame is not a good idea.

Answered By: cs95

Answer #7:

mycolumns = ['A', 'B']
df = pd.DataFrame(columns=mycolumns)
rows = [[1,2],[3,4],[5,6]]
for row in rows:
    df.loc[len(df)] = row
Answered By: Lydia

Answer #8:

You can append a single row as a dictionary using the ignore_index option.

>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
  Animal Color
0    cow  blue
1  horse   red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
  Animal  Color
0    cow   blue
1  horse    red
2  mouse  black
Answered By: W.P. McNeill

Leave a Reply

Your email address will not be published.