### Question :

I have a pandas dataframe with a column called `my_labels`

which contains strings: `'A', 'B', 'C', 'D', 'E'`

. I would like to count the number of occurances of each of these strings then divide the number of counts by the sum of all the counts. I’m trying to do this in Pandas like this:

```
func = lambda x: x.size() / x.sum()
data = frame.groupby('my_labels').apply(func)
```

This code throws an error, ‘DataFrame object has no attribute ‘size’. How can I apply a function to calculate this in Pandas?

##
Answer #1:

`apply`

takes a function to apply to *each* value, not the series, and accepts kwargs.

So, the values do not have the `.size()`

method.

Perhaps this would work:

```
from pandas import *
d = {"my_label": Series(['A','B','A','C','D','D','E'])}
df = DataFrame(d)
def as_perc(value, total):
return value/float(total)
def get_count(values):
return len(values)
grouped_count = df.groupby("my_label").my_label.agg(get_count)
data = grouped_count.apply(as_perc, total=df.my_label.count())
```

The `.agg()`

method here takes a function that is applied to *all* values of the groupby object.

##
Answer #2:

As of Pandas version 0.22, there exists also an alternative to `apply`

: `pipe`

, which can be considerably faster than using `apply`

(you can also check this question for more differences between the two functionalities).

For your example:

```
df = pd.DataFrame({"my_label": ['A','B','A','C','D','D','E']})
my_label
0 A
1 B
2 A
3 C
4 D
5 D
6 E
```

The `apply`

version

```
df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])
```

gives

```
my_label
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
```

and the `pipe`

version

```
df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())
```

yields

```
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
```

So the values are identical, however, the timings differ quite a lot (at least for this small dataframe):

```
%timeit df.groupby('my_label').apply(lambda grp: grp.count() / df.shape[0])
100 loops, best of 3: 5.52 ms per loop
```

and

```
%timeit df.groupby('my_label').pipe(lambda grp: grp.size() / grp.size().sum())
1000 loops, best of 3: 843 µs per loop
```

Wrapping it into a function is then also straightforward:

```
def get_perc(grp_obj):
gr_size = grp_obj.size()
return gr_size / gr_size.sum()
```

Now you can call

```
df.groupby('my_label').pipe(get_perc)
```

yielding

```
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
```

However, for this particular case, you do not even need a `groupby`

, but you can just use `value_counts`

like this:

```
df['my_label'].value_counts(sort=False) / df.shape[0]
```

yielding

```
A 0.285714
C 0.142857
B 0.142857
E 0.142857
D 0.285714
Name: my_label, dtype: float64
```

For this small dataframe it is quite fast

```
%timeit df['my_label'].value_counts(sort=False) / df.shape[0]
1000 loops, best of 3: 770 µs per loop
```

As pointed out by @anmol, the last statement can also be simplified to

```
df['my_label'].value_counts(sort=False, normalize=True)
```

##
Answer #3:

Try:

```
g = pd.DataFrame(['A','B','A','C','D','D','E'])
# Group by the contents of column 0
gg = g.groupby(0)
# Create a DataFrame with the counts of each letter
histo = gg.apply(lambda x: x.count())
# Add a new column that is the count / total number of elements
histo[1] = histo.astype(np.float)/len(g)
print histo
```

Output:

```
0 1
0
A 2 0.285714
B 1 0.142857
C 1 0.142857
D 2 0.285714
E 1 0.142857
```

##
Answer #4:

Regarding the issue with ‘size’, size is not a function on a dataframe, it is rather a property. So instead of using size(), plain size should work

Apart from that, a method like this should work

```
def doCalculation(df):
groupCount = df.size
groupSum = df['my_labels'].notnull().sum()
return groupCount / groupSum
dataFrame.groupby('my_labels').apply(doCalculation)
```

##
Answer #5:

I saw a nested function technique for computing a weighted average on S.O. one time, altering that technique can solve your issue.

```
def group_weight(overall_size):
def inner(group):
return len(group)/float(overall_size)
inner.__name__ = 'weight'
return inner
d = {"my_label": pd.Series(['A','B','A','C','D','D','E'])}
df = pd.DataFrame(d)
print df.groupby('my_label').apply(group_weight(len(df)))
my_label
A 0.285714
B 0.142857
C 0.142857
D 0.285714
E 0.142857
dtype: float64
```

Here is how to do a weighted average within groups

```
def wavg(val_col_name,wt_col_name):
def inner(group):
return (group[val_col_name] * group[wt_col_name]).sum() / group[wt_col_name].sum()
inner.__name__ = 'wgt_avg'
return inner
d = {"P": pd.Series(['A','B','A','C','D','D','E'])
,"Q": pd.Series([1,2,3,4,5,6,7])
,"R": pd.Series([0.1,0.2,0.3,0.4,0.5,0.6,0.7])
}
df = pd.DataFrame(d)
print df.groupby('P').apply(wavg('Q','R'))
P
A 2.500000
B 2.000000
C 4.000000
D 5.545455
E 7.000000
dtype: float64
```