In this notebook we'll go over ways to group, split and pivot data using the DataFrame#group_by and and DataFrame#pivot_table functions.
require 'daru'
Lets first create a hierarchically indexed DataFrame.
# Create a multi-indexed DataFrame
tuples = [
[:a,:one,:bar],
[:a,:one,:baz],
[:a,:two,:bar],
[:a,:two,:baz],
[:b,:one,:bar],
[:b,:two,:bar],
[:b,:two,:baz],
[:b,:one,:foo],
[:c,:one,:bar],
[:c,:one,:baz],
[:c,:two,:foo],
[:c,:two,:bar]
]
multi_index = Daru::MultiIndex.from_tuples(tuples)
vector_arry1 = [11,12,13,14,11,12,13,14,11,12,13,14]
vector_arry2 = [1,2,3,4,1,2,3,4,1,2,3,4]
order_mi = Daru::MultiIndex.from_tuples([
[:a,:one,:bar],
[:a,:two,:baz],
[:b,:two,:foo],
[:b,:one,:foo]])
df_mi = Daru::DataFrame.new([
vector_arry1,
vector_arry2,
vector_arry1,
vector_arry2], order: order_mi, index: multi_index)
To select a row from a multi-indexed dataframe, you should pass the full tuple to the #row[]
method. Partial tuples will return whatever row that match the tuple partially.
# Specify complete tuple to choose a single row
df_mi.row[:a, :one,:bar]
# Specify partial tuple to select index hierarchially
df_mi.row[:a]
The DataFrame#group_by method allows you to group elements in the dataframe by name. This is similar to SQL GROUP BY.
# See grouped rows with the 'groups' method
df = Daru::DataFrame.new({
a: %w{foo bar foo bar foo bar foo foo},
b: %w{one one two three two two one three},
c: [1 ,2 ,3 ,1 ,3 ,6 ,3 ,8],
d: [11 ,22 ,33 ,44 ,55 ,66 ,77 ,88]
})
# Pass the vectors that are to be grouped in an Array to the group_by method. This
# will return a Daru::Core::GroupBy object.
grouped = df.group_by([:a, :b])
# See the groups created using the 'groups' method.
grouped.groups
# First group by the columns :a and :b and then calculate mean of the grouped rows.
grouped.mean
The #get_group method can be used for accessing a particualar group(s).
grouped.get_group(["foo", "one"])
Similar to Excel's Pivot Table, DataFrame provides the #pivot_table functions for quickly pivoting data around a particular value(s) and getting quick insights.
Lets demonstrate using some sales data.
sales = Daru::DataFrame.from_csv 'data/sales-funnel.csv'
The #pivot_table method accepts an option :index, in which you can specify what vectors you want to index your DataFrame against.
sales.pivot_table index: ['Manager', 'Rep']
You can also specify the :values
option for specifying which Vector is to be used for the values.
The :vectors
options lets you specify the columns to pivot against.
The :agg
option specifies the aggregation function. This can be any stats method like :mean, :median, :product, etc.
sales.pivot_table(index: ['Manager','Rep'], values: 'Price', vectors: ['Product'], agg: :sum)