I am trying to concat dataframes based on the foll. 2 csv files:
Both of these have the same number and names of columns. However, when I do this:
I get the error:
AssertionError: Number of manager items must equal union of block items # manager items: 20, # tot_items: 21
How to fix this?
I believe that this error occurs if the following two conditions are met:
- The data frames have different columns. (i.e.
(df1.columns == df2.columns)is
- The columns has a repeated value.
Basically if you
concat dataframes with columns
[B,C,D] it can work out to make one series for each distinct column name. So if I try to join a third dataframe
[B,B,C] it does not know which column to append and ends up with fewer distinct columns than it thinks it needs.
If your dataframes are such that
df1.columns == df2.columns then it will work anyway. So you can join
[B,B,C], but not to
[C,B,B], as if the columns are identical it probably just uses the integer indexes or something.
You can get around this issue with a ‘manual’ concatenation, in this case your
list_of_dfs = [df_a, df_b]
And instead of running
giant_concat_df = pd.concat(list_of_dfs,0)
You can use turn all of the dataframes to a list of dictionaries and then make a new data frame from these lists (merged with chain)
from itertools import chain list_of_dicts = [cur_df.T.to_dict().values() for cur_df in list_of_dfs] giant_concat_df = pd.DataFrame(list(chain(*list_of_dicts)))
The answers here did not solve my issue, but this answer did.
The Issue was duplicated columns in one or both DataFrames.
Here’s a duplicated column fix(as per answer above):
df = df.loc[:,~df.columns.duplicated()]
Unfortunately, the source files are already unavailable, so I can’t check my solution in your case. In my case the error occurred when:
- Data frames have two columns with the same name (I’ve had
idcolumns, which I then converted to lower case, so they become the same)
- Value types of the same-named columns are different
Here is an example which gives me the error in question:
df1 = pd.DataFrame(data=[ ['a', 'b', 'id', 1], ['a', 'b', 'id', 2] ], columns=['A', 'B', 'id', 'id']) df2 = pd.DataFrame(data=[ ['b', 'c', 'id', 1], ['b', 'c', 'id', 2] ], columns=['B', 'C', 'id', 'id']) pd.concat([df1, df2]) AssertionError: Number of manager items must equal union of block items # manager items: 4, # tot_items: 5
Removing / renaming one of the columns makes this code work.