Question :
I’d like to take a dataset with a bunch of different unique individuals, each with multiple entries, and assign each individual a unique id for all of their entries. Here’s an example of the df:
FirstName LastName id
0 Tom Jones 1
1 Tom Jones 1
2 David Smith 1
3 Alex Thompson 1
4 Alex Thompson 1
So, basically I want all entries for Tom Jones to have id=1, all entries for David Smith to have id=2, all entries for Alex Thompson to have id=3, and so on.
So I already have one solution, which is a dead simple python loop iterating two values (One for id, one for index) and assigning the individual an id based on whether they match the previous individual:
x = 1
i = 1
while i < len(df_test):
if (df_test.LastName[i] == df_test.LastName[i-1]) &
(df_test.FirstName[i] == df_test.FirstName[i-1]):
df_test.loc[i, 'id'] = x
i = i+1
else:
x = x+1
df_test.loc[i, 'id'] = x
i = i+1
The problem I’m running into is that the dataframe has about 9 million entries, so with that loop it would have taken a huge amount of time to run. Can anyone think of a more efficient way to do this? I’ve been looking at groupby and multiindexing as potential solutions, but haven’t quite found the right solution yet. Thanks!
Answer #1:
You could join the last name and first name, convert it to a category, and then get the codes.
Of course, multiple people with the same name would have the same id
.
df = df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
>>> df
FirstName LastName id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2
Answer #2:
This approach uses .groupby()
and .ngroup()
(new in Pandas 0.20.2) to create the id
column:
df['id'] = df.groupby(['LastName','FirstName']).ngroup()
>>> df
First Second id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2
I checked timings and, for the small dataset in this example, Alexander’s answer is faster:
%timeit df.assign(id=(df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes)
1000 loops, best of 3: 848 µs per loop
%timeit df.assign(id=df.groupby(['LastName','FirstName']).ngroup())
1000 loops, best of 3: 1.22 ms per loop
However, for larger dataframes, the groupby()
approach appears to be faster. To create a large, representative data set, I used faker
to create a dataframe of 5000 names and then concatenated the first 2000 names to this dataframe to make a dataframe with 7000 names, 2000 of which were duplicates.
import faker
fakenames = faker.Faker()
first = [ fakenames.first_name() for _ in range(5000) ]
last = [ fakenames.last_name() for _ in range(5000) ]
df2 = pd.DataFrame({'FirstName':first, 'LastName':last})
df2 = pd.concat([df2, df2.iloc[:2000]])
Running the timing on this larger data set gives:
%timeit df2.assign(id=(df2['LastName'] + '_' + df2['FirstName']).astype('category').cat.codes)
100 loops, best of 3: 5.22 ms per loop
%timeit df2.assign(id=df2.groupby(['LastName','FirstName']).ngroup())
100 loops, best of 3: 3.1 ms per loop
You may want to test both approaches on your data set to determine which one works best given the size of your data.
Answer #3:
This method allow the ‘id’ column name to be defined with a variable. Plus I find it a little easier to read compared to the assign or groupby methods.
# Create Dataframe
df = pd.DataFrame(
{'FirstName': ['Tom','Tom','David','Alex','Alex'],
'LastName': ['Jones','Jones','Smith','Thompson','Thompson'],
})
newIdName = 'id' # Set new name here.
df[newIdName] = (df['LastName'] + '_' + df['FirstName']).astype('category').cat.codes
Output:
>>> df
FirstName LastName id
0 Tom Jones 0
1 Tom Jones 0
2 David Smith 1
3 Alex Thompson 2
4 Alex Thompson 2