Searching and Combining data in daru

Arel-like query syntax

Arel is a very popular ruby gem that is one of the major components of the most popular ruby frameworks, Rails. It is an ORM-helper of sorts that exposes a beautiful and intuitive syntax for creating SQL strings by chaining Ruby methods.

In daru, we have successfully adopted this syntax and the result is a very intuitive and readable syntax for obtaining any sort of data from a DataFrame or Vector.

Lets see how this syntax works with some examples:

In [1]:
require 'daru'
Out[1]:
true

To use this syntax we call any of the comparator methods defined on Daru::Vector and pass the results from these to the #where method available for both DataFrame and Vector.

To demonstrate with a quick example:

In [2]:
vector = Daru::Vector.new([2,4,5,51,5,16,2,5,3,2,1,5,2,5,2,1,56,234,6,21])
vector.where((vector.eq(5) | vector.eq(1)) & vector.mt(2))
Out[2]:
Daru::Vector:21491980 size: 5
nil
25
45
75
115
135

Note that we use union OR (|) and union AND (&) and not logical OR (||) or logical AND (&&)

Thus the results returned by the Vector#eq method are evaluated by Vector#where and the generated Vector is returned. The index is also preserved.

The where clause can also be used with DataFrame, with similar results.

In [3]:
df = Daru::DataFrame.new({
  a: [1,2,3,4,5,6]*100,
  b: ['a','b','c','d','e','f']*100,
  c: [11,22,33,44,55,66]*100
}, index: (1..600).to_a.shuffle)

df.where(df[:a].eq(2) | df[:c].eq(55))
Out[3]:
Daru::DataFrame:19261300 rows: 200 cols: 3
abc
2632b22
2485e55
3392b22
1605e55
5052b22
3115e55
1992b22
2325e55
4102b22
1255e55
3502b22
5725e55
292b22
1665e55
682b22
2705e55
2732b22
4165e55
112b22
4925e55
852b22
1985e55
3062b22
3645e55
5892b22
2205e55
4672b22
3415e55
2692b22
1905e55
1582b22
3545e55
............
5145e55

The comparator methods on Vector return an object of type Daru::Core::Query::BoolArray which lets us perform OR and AND operations on it. See this blog post for more information on BoolArray and other comparator methods.

As a convenience for readability, you can also you the #and or #or methods instead of #& and #|.

In [4]:
vector.where(vector.eq(2).or(vector.eq(5)).and(vector.mt(2)))
Out[4]:
Daru::Vector:18369620 size: 5
nil
25
45
75
115
135

One of the major advantages of using the where clause over other more robust methods like DataFrame#filter or Vector#keep_if is that it is much faster (though not destructive). These benchmarks prove my point.

Performing joins between DataFrames

Daru::DataFrame offers the #join method for performing SQL style joins between two DataFrames. Currently #join supports inner, left outer, right outer and full outer joins between DataFrames.

To demonstrate:

In [5]:
left = Daru::DataFrame.new({
  :id   => [1,2,3,4],
  :name => ['Pirate', 'Monkey', 'Ninja', 'Spaghetti']
})
right = Daru::DataFrame.new({
  :id => [1,2,3,4],
  :name => ['Rutabaga', 'Pirate', 'Darth Vader', 'Ninja']
})
nil

To perform an inner join on the :name column:

In [6]:
left.join(right, on: [:name], how: :inner)
Out[6]:
Daru::DataFrame:12529040 rows: 2 cols: 3
id_1nameid_2
01Pirate2
13Ninja4

An outer left join can be done with:

In [7]:
left.join(right, on: [:name], how: :left)
Out[7]:
Daru::DataFrame:12260160 rows: 4 cols: 3
id_1nameid_2
01Pirate2
12Monkey
23Ninja4
34Spaghetti

An outer right join can be done like so:

In [8]:
left.join(right, on: [:name], how: :right)
Out[8]:
Daru::DataFrame:11720640 rows: 4 cols: 3
id_1nameid_2
0Rutabaga1
11Pirate2
2Darth Vader3
33Ninja4

And finally, a full outer join:

In [9]:
left.join(right, on: [:name], how: :outer)
Out[9]:
Daru::DataFrame:10555680 rows: 6 cols: 3
id_1nameid_2
01Pirate2
12Monkey
23Ninja4
34Spaghetti
4Rutabaga1
5Darth Vader3