Thursday, 15 January 2015

How to apply "first" and "last" functions to columns while using group by in pandas? -



How to apply "first" and "last" functions to columns while using group by in pandas? -

i have info frame , grouping particular column (or, in other words, values particular column). can in next way: grouped = df.groupby(['columnname']).

i imagine result of operation table in cells can contain sets of values instead of single values. usual table (i.e. table in every cell contains 1 single value) need indicate function want utilize transform sets of values in cells single values.

for illustration can replace sets of values sum, or minimal or maximal value. can in next way: grouped.sum() or grouped.min() , on.

now want utilize different functions different columns. figured out can in next way: grouped.agg({'columnname1':sum, 'columnname2':min}).

however, because of reasons cannot utilize first. in more details, grouped.first() works, grouped.agg({'columnname1':first, 'columnname2':first}) not work. result nameerror: nameerror: name 'first' not defined. so, question is: why happen , how resolve problem.

added

here found next example:

grouped['d'].agg({'result1' : np.sum, 'result2' : np.mean})

may need utilize np? in case python not recognize "np". should import it?

i think issue there 2 different first methods share name deed differently, 1 groupby objects , another series/dataframe (to timeseries).

to replicate behaviour of groupby first method on dataframe using agg utilize iloc[0] (which gets first row in each grouping (dataframe/series) index):

grouped.agg(lambda x: x.iloc[0])

for example:

in [1]: df = pd.dataframe([[1, 2], [3, 4]]) in [2]: g = df.groupby(0) in [3]: g.first() out[3]: 1 0 1 2 3 4 in [4]: g.agg(lambda x: x.iloc[0]) out[4]: 1 0 1 2 3 4

analogously can replicate last using iloc[-1].

note: works column-wise, et al:

g.agg({1: lambda x: x.iloc[0]})

in older version of pandas utilize irow method (e.g. x.irow(0), see previous edits.

a couple of updated notes:

this improve done using nth groupby method, much faster >=0.13:

g.nth(0) # first g.nth(-1) # lastly

you have take care little, default behaviour first , last ignores nan rows... , iirc dataframe groupbys broken pre-0.13... there's dropna alternative nth.

you can utilize strings rather built-ins (though iirc pandas spots it's sum builtin , applies np.sum):

grouped['d'].agg({'result1' : "sum", 'result2' : "mean"})

group-by pandas

No comments:

Post a Comment