This notebook deals with categorical data support which is now added to Daru. With this Daru can handle categorical data.

In [1]:
require 'daru'
Out[1]:
true

Initialization

Initialize a vector whose data is categorical by specifying type: :category

In [2]:
dv = Daru::Vector.new [:a, 1, :a, 1, :c], type: :category
Out[2]:
Daru::Vector(5)
0 a
1 1
2 a
3 1
4 c
In [3]:
dv.frequencies
Out[3]:
Daru::Vector(3)
a 2
1 2
c 1

You can initialize it with some predefined categories even though they do not exist using categories option.

In [4]:
dv = Daru::Vector.new [:a, 1, :a, 1, :c], type: :category, categories: [:a, :b, :c, 1]
Out[4]:
Daru::Vector(5)
0 a
1 1
2 a
3 1
4 c

categories option initalizes new categories and also specify the order in which they should occur. So now if you see the frequency table it would be ordered with the order you specified.

In [5]:
dv.frequencies
Out[5]:
Daru::Vector(4)
a 2
b 0
c 1
1 2

Since categorical data can be ordered as well as unordered you can specify whether the vector is ordered or not using the ordered: true or ordered: false during initialization.

In [6]:
dv = Daru::Vector.new [:a, 1, :a, 1, :c], categories: [:a, :b, :c, 1], ordered: false, type: :category
Out[6]:
Daru::Vector(5)
0 a
1 1
2 a
3 1
4 c
In [7]:
dv.min
ArgumentError: Can not apply min when vector is unordered. To make the categorical data ordered, use #ordered = true
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:383:in `assert_ordered'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:216:in `min'
(pry):7:in `<main>'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `evaluate_ruby'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:323:in `handle_line'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:243:in `block (2 levels) in eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `catch'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `block in eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `catch'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:65:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:12:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:87:in `execute_request'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:47:in `dispatch'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:37:in `run'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:70:in `run_kernel'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:34:in `run'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/bin/iruby:5:in `<top (required)>'
/home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `load'
/home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `<main>'
/home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `<main>'

As you can see you can't do the comparision if vector is not ordered. Lets make it ordered.

In [8]:
dv = Daru::Vector.new [:a, 1, :a, 1, :c], ordered: true, categories: [:a, :b, :c, 1], type: :category
Out[8]:
Daru::Vector(5)
0 a
1 1
2 a
3 1
4 c
In [9]:
dv.min
Out[9]:
:a
In [10]:
dv.sort!
Out[10]:
Daru::Vector(5)
0 a
2 a
4 c
1 1
3 1

#categories= and #categories

Beside during the initialization you can also set the categories after the vector has been initialized.

In [11]:
dv = Daru::Vector.new [:a, 1, :c, 1, :c], type: :category
dv.categories = [:a, :b, :c, 1]
Out[11]:
[:a, :b, :c, 1]

You can also check all the categories associated with the vector.

In [12]:
dv.categories
Out[12]:
[:a, :b, :c, 1]

#ordered= and #ordered?

You can specify if the vector has to be treated as ordered or not after initialization of vector.

Note: By default the vector will be unordered

In [13]:
dv = Daru::Vector.new [:a, 1, :c, 1, :c], type: :category
dv.ordered?
Out[13]:
false
In [14]:
dv.ordered = true
dv.ordered?
Out[14]:
true

#summary

Here are a few measures to summarize categorical vector.

In [15]:
dv = Daru::Vector.new [:a, :a, :a, :b, :b, :c], type: :category
dv.summary
Out[15]:
Daru::Vector(6)
size 6
categories 3
max_freq 3
max_category a
min_freq 1
min_category c

#frequencies

Gives the frequency of each category in the order they occur.

In [16]:
dv = Daru::Vector.new ['third']*3 + ['second']*2 + ['first'], type: :category, categories: ['first', 'second', 'third']
dv.frequencies
Out[16]:
Daru::Vector(3)
first 1
second 2
third 3

#min, #max and #sort!

Note: These operations only apply if the vector is ordered.

In [17]:
dv
Out[17]:
Daru::Vector(6)
0 third
1 third
2 third
3 second
4 second
5 first
In [18]:
dv.min
ArgumentError: Can not apply min when vector is unordered. To make the categorical data ordered, use #ordered = true
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:383:in `assert_ordered'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:216:in `min'
(pry):23:in `<main>'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `evaluate_ruby'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:323:in `handle_line'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:243:in `block (2 levels) in eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `catch'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `block in eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `catch'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:65:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:12:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:87:in `execute_request'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:47:in `dispatch'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:37:in `run'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:70:in `run_kernel'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:34:in `run'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/bin/iruby:5:in `<top (required)>'
/home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `load'
/home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `<main>'
/home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `<main>'
In [19]:
dv.ordered = true
Out[19]:
true
In [20]:
dv.min
Out[20]:
"first"
In [21]:
dv.max
Out[21]:
"third"
In [22]:
dv.sort!
Out[22]:
Daru::Vector(6)
5 first
3 second
4 second
0 third
1 third
2 third

#add_category

Associates new categories with the vector.

Note: In order to insert a new categorical value you need to use #add_category to make sure this category is registered in the vector. For example -

In [23]:
dv
Out[23]:
Daru::Vector(6)
5 first
3 second
4 second
0 third
1 third
2 third
In [24]:
dv[0] = 'fourth'
ArgumentError: Invalid category fourth, to add a new category use #add_category
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:505:in `modify_category_at'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/daru-0.1.3.1/lib/daru/category.rb:144:in `[]='
(pry):29:in `<main>'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:355:in `evaluate_ruby'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:323:in `handle_line'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:243:in `block (2 levels) in eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `catch'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:242:in `block in eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `catch'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/pry-0.10.3/lib/pry/pry_instance.rb:241:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:65:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/backend.rb:12:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:87:in `execute_request'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:47:in `dispatch'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/kernel.rb:37:in `run'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:70:in `run_kernel'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/lib/iruby/command.rb:34:in `run'
/home/ubuntu/.rvm/gems/ruby-2.2.3/gems/iruby-0.2.9/bin/iruby:5:in `<top (required)>'
/home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `load'
/home/ubuntu/.rvm/gems/ruby-2.2.1/bin/iruby:23:in `<main>'
/home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `eval'
/home/ubuntu/.rvm/gems/ruby-2.2.3/bin/ruby_executable_hooks:15:in `<main>'
In [25]:
dv.add_category 'fourth'
dv[0] = 'fourth'
dv
Out[25]:
Daru::Vector(6)
5 first
3 second
4 second
0 fourth
1 third
2 third
In [26]:
dv.categories
Out[26]:
["first", "second", "third", "fourth"]

#rename_categories

You can rename subset of existing categories by passing a hash mapping old ones to new ones.

In [27]:
dv = Daru::Vector.new [1, 2, 'third', 2, 1], type: :category
Out[27]:
Daru::Vector(5)
0 1
1 2
2 third
3 2
4 1
In [28]:
dv.rename_categories 1 => 'first', 2 => 'second'
dv
Out[28]:
Daru::Vector(5)
0 first
1 second
2 third
3 second
4 first

Indexing

#[], #[]=, #at, #at_set

Indexing works similar to an ordinary vector, so you can expect these methods to do the same as with ordinary vector. Here are few examples:

In [29]:
dv = Daru::Vector.new [1, 1, 2, 2, 3, 1], index: :a..:f, type: :category
Out[29]:
Daru::Vector(6)
a 1
b 1
c 2
d 2
e 3
f 1
In [30]:
dv[0..2]
Out[30]:
Daru::Vector(3)
a 1
b 1
c 2
In [31]:
dv.at -1
Out[31]:
1
In [32]:
dv.set_at [0, 1], 3
dv
Out[32]:
Daru::Vector(6)
a 3
b 3
c 2
d 2
e 3
f 1

Querying

#where

Daru uses Arel-like syntax for querying data.

In [33]:
dv = Daru::Vector.new ['I', 'II', 'I', 'III', 'IV', 'I', 'II'], type: :category, categories: ['I', 'II', 'III', 'IV']
dv.ordered = true
dv.frequencies
Out[33]:
Daru::Vector(4)
I 3
II 2
III 1
IV 1
In [34]:
dv.where(dv.eq('I'))
Out[34]:
Daru::Vector(3)
0 I
2 I
5 I
In [35]:
dv.where(dv.gt('II'))
Out[35]:
Daru::Vector(2)
3 III
4 IV
In [40]:
df = Daru::DataFrame.new({
  a: (1..7).to_a,
  b: ('a'..'g').to_a,
  c: ['I', 'II', 'I', 'III', 'IV', 'I', 'II']
  })
Out[40]:
Daru::DataFrame(7x3)
a b c
0 1 a I
1 2 b II
2 3 c I
3 4 d III
4 5 e IV
5 6 f I
6 7 g II
In [41]:
df.c = df.c.to_category
df
Out[41]:
Daru::DataFrame(7x3)
a b c
0 1 a I
1 2 b II
2 3 c I
3 4 d III
4 5 e IV
5 6 f I
6 7 g II
In [46]:
df.where(df.c.gt('I') & df.c.lt('IV'))
Out[46]:
Daru::DataFrame(3x3)
a b c
1 2 b II
3 4 d III
6 7 g II

Contrast coding

Categorical data supports 4 types of contrast coding schemes-

  1. Dummy Coding (Default)
  2. Simple Coding
  3. Helmert Coding
  4. Deviation Coding
In [50]:
dv = Daru::Vector.new ['I', 'II', 'I', 'III', 'IV', 'I', 'II'], type: :category, categories: ['I', 'II', 'III', 'IV']
dv.name = 'Rank'
dv.contrast_code
Out[50]:
Daru::DataFrame(7x3)
Rank_II Rank_III Rank_IV
0 0 0 0
1 1 0 0
2 0 0 0
3 0 1 0
4 0 0 1
5 0 0 0
6 1 0 0

You can set the base category using #base_category=

In [51]:
dv.base_category = 'IV'
dv.contrast_code
Out[51]:
Daru::DataFrame(7x3)
Rank_I Rank_II Rank_III
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 0 0
5 1 0 0
6 0 1 0

To use any other coding using #coding_scheme

In [55]:
dv.coding_scheme = :deviation
dv.contrast_code
Out[55]:
Daru::DataFrame(7x3)
Rank_I Rank_II Rank_III
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 -1 -1 -1
5 1 0 0
6 0 1 0
In [ ]: