Grouping, Splitting and Pivoting Data

In this notebook we'll go over ways to group, split and pivot data using the DataFrame#group_by and and DataFrame#pivot_table functions.

In [1]:
require 'daru'
Out[1]:
true

Lets first create a hierarchically indexed DataFrame.

In [2]:
# Create a multi-indexed DataFrame

tuples = [
  [:a,:one,:bar],
  [:a,:one,:baz],
  [:a,:two,:bar],
  [:a,:two,:baz],
  [:b,:one,:bar],
  [:b,:two,:bar],
  [:b,:two,:baz],
  [:b,:one,:foo],
  [:c,:one,:bar],
  [:c,:one,:baz],
  [:c,:two,:foo],
  [:c,:two,:bar]
]
multi_index = Daru::MultiIndex.from_tuples(tuples)

vector_arry1 = [11,12,13,14,11,12,13,14,11,12,13,14]
vector_arry2 = [1,2,3,4,1,2,3,4,1,2,3,4]

order_mi = Daru::MultiIndex.from_tuples([
  [:a,:one,:bar],
  [:a,:two,:baz],
  [:b,:two,:foo],
  [:b,:one,:foo]])

df_mi = Daru::DataFrame.new([
  vector_arry1, 
  vector_arry2, 
  vector_arry1, 
  vector_arry2], order: order_mi, index: multi_index)
Out[2]:
Daru::DataFrame:24162060 rows: 12 cols: 4
[:a, :one, :bar][:a, :two, :baz][:b, :two, :foo][:b, :one, :foo]
[:a, :one, :bar]111111
[:a, :one, :baz]122122
[:a, :two, :bar]133133
[:a, :two, :baz]144144
[:b, :one, :bar]111111
[:b, :two, :bar]122122
[:b, :two, :baz]133133
[:b, :one, :foo]144144
[:c, :one, :bar]111111
[:c, :one, :baz]122122
[:c, :two, :foo]133133
[:c, :two, :bar]144144

To select a row from a multi-indexed dataframe, you should pass the full tuple to the #row[] method. Partial tuples will return whatever row that match the tuple partially.

In [3]:
# Specify complete tuple to choose a single row
df_mi.row[:a, :one,:bar]
Out[3]:
Daru::Vector:22866540 size: 4
0
[:a, :one, :bar]11
[:a, :two, :baz]1
[:b, :two, :foo]11
[:b, :one, :foo]1
In [4]:
# Specify partial tuple to select index hierarchially
df_mi.row[:a]
Out[4]:
Daru::DataFrame:22500640 rows: 4 cols: 4
[:a, :one, :bar][:a, :two, :baz][:b, :two, :foo][:b, :one, :foo]
[:one, :bar]111111
[:one, :baz]122122
[:two, :bar]133133
[:two, :baz]144144

Grouping

The DataFrame#group_by method allows you to group elements in the dataframe by name. This is similar to SQL GROUP BY.

In [5]:
# See grouped rows with the 'groups' method

df = Daru::DataFrame.new({
  a: %w{foo bar foo bar   foo bar foo foo},
  b: %w{one one two three two two one three},
  c:   [1  ,2  ,3  ,1    ,3  ,6  ,3  ,8],
  d:   [11 ,22 ,33 ,44   ,55 ,66 ,77 ,88]
})

# Pass the vectors that are to be grouped in an Array to the group_by method. This 
# will return a Daru::Core::GroupBy object.
grouped = df.group_by([:a, :b])

# See the groups created using the 'groups' method.
grouped.groups
Out[5]:
{["bar", "one"]=>[1], ["bar", "three"]=>[3], ["bar", "two"]=>[5], ["foo", "one"]=>[0, 6], ["foo", "three"]=>[7], ["foo", "two"]=>[2, 4]}
In [6]:
# First group by the columns :a and :b and then calculate mean of the grouped rows.
grouped.mean
Out[6]:
Daru::DataFrame:21484420 rows: 6 cols: 2
cd
["bar", "one"]222
["bar", "three"]144
["bar", "two"]666
["foo", "one"]2.044.0
["foo", "three"]888
["foo", "two"]3.044.0

The #get_group method can be used for accessing a particualar group(s).

In [7]:
grouped.get_group(["foo", "one"])
Out[7]:
Daru::DataFrame:21046320 rows: 2 cols: 4
abcd
0fooone111
6fooone377

Pivoting Data

Similar to Excel's Pivot Table, DataFrame provides the #pivot_table functions for quickly pivoting data around a particular value(s) and getting quick insights.

Lets demonstrate using some sales data.

In [8]:
sales = Daru::DataFrame.from_csv 'data/sales-funnel.csv'
Out[8]:
Daru::DataFrame:18812540 rows: 17 cols: 8
AccountManagerNamePriceProductQuantityRepStatus
0714466Debra HenleyTrantow-Barrows30000CPU1Craig Bookerpresented
1714466Debra HenleyTrantow-Barrows10000Software1Craig Bookerpresented
2714466Debra HenleyTrantow-Barrows5000Maintenance2Craig Bookerpending
3737550Debra HenleyFritsch, Russel and Anderson35000CPU1Craig Bookerdeclined
4146832Debra HenleyKiehn-Spinka65000CPU2Daniel Hiltonwon
5218895Debra HenleyKulas Inc40000CPU2Daniel Hiltonpending
6218895Debra HenleyKulas Inc10000Software1Daniel Hiltonpresented
7412290Debra HenleyJerde-Hilpert5000Maintenance2John Smithpending
8740150Debra HenleyBarton LLC35000CPU1John Smithdeclined
9141962Fred AndersonHerman LLC65000CPU2Cedric Mosswon
10163416Fred AndersonPurdy-Kunde30000CPU1Cedric Mosspresented
11239344Fred AndersonStokes LLC5000Maintenance1Cedric Mosspending
12239344Fred AndersonStokes LLC10000Software1Cedric Mosspresented
13307599Fred AndersonKassulke, Ondricka and Metz7000Maintenance3Wendy Yulewon
14688981Fred AndersonKeeling LLC100000CPU5Wendy Yulewon
15729833Fred AndersonKoepp Ltd65000CPU2Wendy Yuledeclined
16729833Fred AndersonKoepp Ltd5000Monitor2Wendy Yulepresented

The #pivot_table method accepts an option :index, in which you can specify what vectors you want to index your DataFrame against.

In [9]:
sales.pivot_table index: ['Manager', 'Rep']
Out[9]:
Daru::DataFrame:17258300 rows: 5 cols: 3
AccountPriceQuantity
["Debra Henley", "Craig Booker"]720237.020000.01.25
["Debra Henley", "Daniel Hilton"]194874.038333.3333333333361.6666666666666667
["Debra Henley", "John Smith"]576220.020000.01.5
["Fred Anderson", "Cedric Moss"]196016.527500.01.25
["Fred Anderson", "Wendy Yule"]614061.544250.03.0

You can also specify the :values option for specifying which Vector is to be used for the values.

The :vectors options lets you specify the columns to pivot against.

The :agg option specifies the aggregation function. This can be any stats method like :mean, :median, :product, etc.

In [10]:
sales.pivot_table(index: ['Manager','Rep'], values: 'Price', vectors: ['Product'], agg: :sum)
Out[10]:
Daru::DataFrame:15752920 rows: 5 cols: 12
["Account", "CPU"]["Account", "Software"]["Account", "Maintenance"]["Price", "CPU"]["Price", "Software"]["Price", "Maintenance"]["Quantity", "CPU"]["Quantity", "Software"]["Quantity", "Maintenance"]["Account", "Monitor"]["Price", "Monitor"]["Quantity", "Monitor"]
["Debra Henley", "Craig Booker"]145201671446671446665000100005000212
["Debra Henley", "Daniel Hilton"]3657272188951050001000041
["Debra Henley", "John Smith"]74015041229035000500012
["Fred Anderson", "Cedric Moss"]30537823934423934495000100005000311
["Fred Anderson", "Wendy Yule"]141881430759916500070007372983350002