Introduction to daru (Data Analysis in RUby)

Sameer Deshmukh

Deccan Ruby Conf 2015, Pune, India.

In [1]:
require 'daru'
require 'distribution'
require 'gnuplotrb'
Out[1]:
true

Creating a Daru::Vector

Vectors are indexed by passing data using the index option, and named with name

In [2]:
vector = Daru::Vector.new(
  [20,40,25,50,45,12], index: ['cherry', 'apple', 'barley', 'wheat', 'rice', 'sugar'], 
  name: "Prices of stuff.")
Out[2]:
Daru::Vector:15070620 size: 6
Prices of stuff.
cherry20
apple40
barley25
wheat50
rice45
sugar12

Retreive a single value

Specify the index you want to retrieve in the #[] operator

In [3]:
vector['rice']
Out[3]:
45

Retreive multiple values

Multiple values can be retreived at the same time as another Daru::Vector by separating them with commas.

In [4]:
vector['rice', 'wheat', 'sugar']
Out[4]:
Daru::Vector:14387920 size: 3
Prices of stuff.
rice45
wheat50
sugar12

Retreive a slice with a Range

Specifying a range of indexes will retrieve a slice of the Daru::Vector

In [5]:
vector['barley'..'sugar']
Out[5]:
Daru::Vector:14063700 size: 4
Prices of stuff.
barley25
wheat50
rice45
sugar12

Assign a value

Assign a value by specifying the index directly to the #[]= operator

In [6]:
vector['barley'] = 1500
vector
Out[6]:
Daru::Vector:15070620 size: 6
Prices of stuff.
cherry20
apple40
barley1500
wheat50
rice45
sugar12

Creating a Daru::DataFrame

The :index option is used for specifying the row index of the DataFrame and the :order option determines the order in which they will be stored.

Note that this is only one way of creating a DataFrame. There are around 8 different ways you can do so, depending on your use case.

In [7]:
df = Daru::DataFrame.new({
  'col0' => [1,2,3,4,5,6],
  'col2' => ['a','b','c','d','e','f'],
  'col1' => [11,22,33,44,55,66]
  }, 
  index: ['one', 'two', 'three', 'four', 'five', 'six'], 
  order: ['col0', 'col1', 'col2']
)
Out[7]:
Daru::DataFrame:13337740 rows: 6 cols: 3
col0col1col2
one111a
two222b
three333c
four444d
five555e
six666f

Accessing a Column

A DataFrame column can be accessed using the DataFrame#[] operator.

Note that it returns a Daru::Vector

In [8]:
df['col1']
Out[8]:
Daru::Vector:13292960 size: 6
col1
one11
two22
three33
four44
five55
six66

Accessing multiple Columns

Multiple columns can be accessed by separating them with a comma. The result is another DataFrame.

In [9]:
df['col2', 'col0']
Out[9]:
Daru::DataFrame:12423020 rows: 6 cols: 2
col2col0
onea1
twob2
threec3
fourd4
fivee5
sixf6

Accessing a Range of Columns

A slice of the DataFrame by columns can be obtained by specifying a Range in #[]

In [10]:
df['col1'..'col2']
Out[10]:
Daru::DataFrame:12007160 rows: 6 cols: 2
col1col2
one11a
two22b
three33c
four44d
five55e
six66f

Assigning a Column

You can assign a Daru::Vector to a column and the indexes of the Vector will be automatically matched to that of the DataFrame.

In [11]:
df['col1'] = Daru::Vector.new(['this', 'is', 'some','new','data','here'], 
  index: ['one', 'three','two','six','four', 'five'])
df
Out[11]:
Daru::DataFrame:13337740 rows: 6 cols: 3
col0col1col2
one1thisa
two2someb
three3isc
four4datad
five5heree
six6newf

Accessing a Row

A single row can be accessed using the #row[] function.

In [12]:
df.row['four']
Out[12]:
Daru::Vector:11115780 size: 3
four
col04
col1data
col2d

Accessing a Range of Rows

Specifying a Range of Row indexes in #row[] will select a DataFrame with those rows

In [13]:
df.row['three'..'five']
Out[13]:
Daru::DataFrame:9135240 rows: 3 cols: 3
col0col1col2
three3isc
four4datad
five5heree

Assigning a Row

You can also assign a Row with Daru::Vector. Notice that indexes are mathced according to the order of the DataFrame.

In [14]:
df.row['five'] = [666,555,333]
Out[14]:
[666, 555, 333]

Statistics on Vector with missing data

A host of static and rolling statistics methods are provided on Daru::Vector.

Note that missing data (very common in most real world scenarios) is gracefully handled

In [15]:
vector = Daru::Vector.new([1,3,5,nil,2,53,nil])
vector.mean
Out[15]:
12.8

Statistics on DataFrame

DataFrame statistics will basically apply the concerned method on all numerical columns of the DataFrame.

In [16]:
df.mean
Out[16]:
Daru::Vector:8060380 size: 1
mean
col0113.66666666666667

Useful statistics about the vectors in a DataFrame can be observed with #describe

In [17]:
df.describe
Out[17]:
Daru::DataFrame:7470980 rows: 5 cols: 1
col0
count6
mean113.66666666666667
std270.5924364550249
min1
max666

Time Series Support

Daru offers a robust time series manipulation API for indexing data based on timestamps. This makes daru a viable tool for analyzing financial data (or any data that changes with time)

The DateTimeIndex

The DateTimeIndex is a special index for indexing data based on timestamps.

A date index range can be created using the DateTimeIndex.date_range function. The :freq option decides the time frequency between each timestamp in the date index.

In [18]:
index = Daru::DateTimeIndex.date_range(:start => '2012', :periods => 1000, :freq => '3D')
Out[18]:
#<DateTimeIndex:6151760 offset=3D periods=1000 data=[2012-01-01T00:00:00+00:00...2020-03-16T00:00:00+00:00]>

A Daru::Vector can be created by simply passing the newly created index object into the :index argument.

In [19]:
timeseries = Daru::Vector.new(1000.times.map {rand}, index: index)
Out[19]:
Daru::Vector:5628020 size: 1000
nil
2012-01-01T00:00:00+00:000.692831672574459
2012-01-04T00:00:00+00:000.6971783281963972
2012-01-07T00:00:00+00:000.34687766698487965
2012-01-10T00:00:00+00:000.5509404993547384
2012-01-13T00:00:00+00:000.10166975999865946
2012-01-16T00:00:00+00:000.34183413903843207
2012-01-19T00:00:00+00:000.018428168123970967
2012-01-22T00:00:00+00:000.7792652522504137
2012-01-25T00:00:00+00:000.24793667731961144
2012-01-28T00:00:00+00:000.7200752551979407
2012-01-31T00:00:00+00:000.770756064084555
2012-02-03T00:00:00+00:000.6475396341969668
2012-02-06T00:00:00+00:000.00034544180080875453
2012-02-09T00:00:00+00:000.9881939271758362
2012-02-12T00:00:00+00:000.042428559674003274
2012-02-15T00:00:00+00:000.6604582692043693
2012-02-18T00:00:00+00:000.6446959879056338
2012-02-21T00:00:00+00:000.11606340772777746
2012-02-24T00:00:00+00:000.5238981665473298
2012-02-27T00:00:00+00:000.25979569124671453
2012-03-01T00:00:00+00:000.1808967702663009
2012-03-04T00:00:00+00:000.04614156947957693
2012-03-07T00:00:00+00:000.8935716437439504
2012-03-10T00:00:00+00:000.7197074871013468
2012-03-13T00:00:00+00:000.20741375904156445
2012-03-16T00:00:00+00:000.501647901862296
2012-03-19T00:00:00+00:000.9470421480253584
2012-03-22T00:00:00+00:000.2954430257659184
2012-03-25T00:00:00+00:000.18422816661946229
2012-03-28T00:00:00+00:000.48737285121462925
2012-03-31T00:00:00+00:000.7549290269495055
2012-04-03T00:00:00+00:000.8216050188191338
......
2020-03-16T00:00:00+00:000.8324422863437039

Accessing data by partial timestamps

When a Vector or DataFrame is indexed by a DateTimeIndex, it allows you to partially specify the date to retreive all the data that belongs to that date.

For example, to access all the data belonging to the year 2012.

In [20]:
timeseries['2012']
Out[20]:
Daru::Vector:15406520 size: 122
nil
2012-01-01T00:00:00+00:000.692831672574459
2012-01-04T00:00:00+00:000.6971783281963972
2012-01-07T00:00:00+00:000.34687766698487965
2012-01-10T00:00:00+00:000.5509404993547384
2012-01-13T00:00:00+00:000.10166975999865946
2012-01-16T00:00:00+00:000.34183413903843207
2012-01-19T00:00:00+00:000.018428168123970967
2012-01-22T00:00:00+00:000.7792652522504137
2012-01-25T00:00:00+00:000.24793667731961144
2012-01-28T00:00:00+00:000.7200752551979407
2012-01-31T00:00:00+00:000.770756064084555
2012-02-03T00:00:00+00:000.6475396341969668
2012-02-06T00:00:00+00:000.00034544180080875453
2012-02-09T00:00:00+00:000.9881939271758362
2012-02-12T00:00:00+00:000.042428559674003274
2012-02-15T00:00:00+00:000.6604582692043693
2012-02-18T00:00:00+00:000.6446959879056338
2012-02-21T00:00:00+00:000.11606340772777746
2012-02-24T00:00:00+00:000.5238981665473298
2012-02-27T00:00:00+00:000.25979569124671453
2012-03-01T00:00:00+00:000.1808967702663009
2012-03-04T00:00:00+00:000.04614156947957693
2012-03-07T00:00:00+00:000.8935716437439504
2012-03-10T00:00:00+00:000.7197074871013468
2012-03-13T00:00:00+00:000.20741375904156445
2012-03-16T00:00:00+00:000.501647901862296
2012-03-19T00:00:00+00:000.9470421480253584
2012-03-22T00:00:00+00:000.2954430257659184
2012-03-25T00:00:00+00:000.18422816661946229
2012-03-28T00:00:00+00:000.48737285121462925
2012-03-31T00:00:00+00:000.7549290269495055
2012-04-03T00:00:00+00:000.8216050188191338
......
2012-12-29T00:00:00+00:000.26155523165437944

Or to access data whose time stamp is March 2012...

In [21]:
timeseries['2012-3']
Out[21]:
Daru::Vector:14832480 size: 11
nil
2012-03-01T00:00:00+00:000.1808967702663009
2012-03-04T00:00:00+00:000.04614156947957693
2012-03-07T00:00:00+00:000.8935716437439504
2012-03-10T00:00:00+00:000.7197074871013468
2012-03-13T00:00:00+00:000.20741375904156445
2012-03-16T00:00:00+00:000.501647901862296
2012-03-19T00:00:00+00:000.9470421480253584
2012-03-22T00:00:00+00:000.2954430257659184
2012-03-25T00:00:00+00:000.18422816661946229
2012-03-28T00:00:00+00:000.48737285121462925
2012-03-31T00:00:00+00:000.7549290269495055

Specifying the date precisely will return the exact data point (You can also pass a ruby DateTime object for precisely obtaining data).

In [22]:
timeseries['2012-3-10']
Out[22]:
0.7197074871013468

Say you have per second data about the price of a commodity and want to access the prices for the minute on 23rd of March 2012 at 12:42 pm

In [23]:
index      = Daru::DateTimeIndex.date_range(
  :start => '2012-3-23 11:00', :periods => 20000, :freq => 'S')

seconds_ts = Daru::Vector.new(20000.times.map { rand(50) }, index: index)
seconds_ts['2012-3-23 12:42']
Out[23]:
Daru::Vector:28416340 size: 60
nil
2012-03-23T12:42:00+00:004
2012-03-23T12:42:01+00:0032
2012-03-23T12:42:02+00:0035
2012-03-23T12:42:03+00:0035
2012-03-23T12:42:04+00:0014
2012-03-23T12:42:05+00:001
2012-03-23T12:42:06+00:0043
2012-03-23T12:42:07+00:0039
2012-03-23T12:42:08+00:0020
2012-03-23T12:42:09+00:0016
2012-03-23T12:42:10+00:0043
2012-03-23T12:42:11+00:000
2012-03-23T12:42:12+00:0027
2012-03-23T12:42:13+00:0043
2012-03-23T12:42:14+00:0043
2012-03-23T12:42:15+00:0018
2012-03-23T12:42:16+00:0035
2012-03-23T12:42:17+00:0039
2012-03-23T12:42:18+00:0035
2012-03-23T12:42:19+00:0023
2012-03-23T12:42:20+00:0025
2012-03-23T12:42:21+00:0013
2012-03-23T12:42:22+00:005
2012-03-23T12:42:23+00:0043
2012-03-23T12:42:24+00:0013
2012-03-23T12:42:25+00:0028
2012-03-23T12:42:26+00:002
2012-03-23T12:42:27+00:0042
2012-03-23T12:42:28+00:0029
2012-03-23T12:42:29+00:0036
2012-03-23T12:42:30+00:0044
2012-03-23T12:42:31+00:0036
......
2012-03-23T12:42:59+00:008

Visualization

Simple Visualization with interactive graphs

Plotting a simple scatter plot from a DataFrame. Nyaplot integration provides interactivity.

DataFrame denoting Ice Cream sales of a particular food chain in a city according to the maximum recorded temperature in that city. It also lists the staff strength present in each city.

In [24]:
df = Daru::DataFrame.new({
  :temperature => [30.4, 23.5, 44.5, 20.3, 34, 24, 31.45, 28.34, 37, 24],
  :sales       => [350, 150, 500, 200, 480, 250, 330, 400, 420, 560],
  :city        => ['Pune', 'Delhi']*5,
  :staff       => [15,20]*5
})
df
Out[24]:
Daru::DataFrame:4800060 rows: 10 cols: 4
citysalesstafftemperature
0Pune3501530.4
1Delhi1502023.5
2Pune5001544.5
3Delhi2002020.3
4Pune4801534
5Delhi2502024
6Pune3301531.45
7Delhi4002028.34
8Pune4201537
9Delhi5602024

The plot below is between Temperature in the city and the sales of ice cream.

In [25]:
df.plot(type: :scatter, x: :temperature, y: :sales) do |plot, diagram|
  plot.x_label "Temperature"
  plot.y_label "Sales"
  plot.yrange [100, 600]
  plot.xrange [15, 50]
  diagram.tooltip_contents([:city, :staff])
  # Set the color scheme for this diagram.
  diagram.color(Nyaplot::Colors.qual) 
  # Change color of each point WRT to the city that it belongs to.
  diagram.fill_by(:city)
  # Shape each point WRT to the city that it belongs to.
  diagram.shape_by(:city) 
end

Use with GNU plot

Plotting a time series with it's rolling mean

Init a random number generator for creating a normal distribution

In [26]:
rng = Distribution::Normal.rng
Out[26]:
#<Proc:0x0000000368b250@/home/ubuntu/.rvm/gems/ruby-2.2.1/gems/distribution-0.7.3/lib/distribution/normal/gsl.rb:8 (lambda)>
In [27]:
index  = Daru::DateTimeIndex.date_range(:start => '2012-4-2', :periods => 1000)
vector = Daru::Vector.new(1000.times.map {rng.call}, index: index)
vector = vector.cumsum
rolling_mean = vector.rolling_mean 60

GnuplotRB::Plot.new(
  [vector      , with: 'lines', title: 'Vector'], 
  [rolling_mean, with: 'lines', title: 'Rolling Mean'],
  xlabel: 'Time', ylabel: 'Value'
)
Out[27]:
Gnuplot Produced by GNUPLOT 5.0 patchlevel 3 -15 -10 -5 0 5 10 15 20 25 01 Apr 2012 01 Jul 2012 01 Oct 2012 01 Jan 2013 01 Apr 2013 01 Jul 2013 01 Oct 2013 01 Jan 2014 01 Apr 2014 01 Jul 2014 01 Oct 2014 01 Jan 2015 Value Time Vector Vector Rolling Mean Rolling Mean

Arel-like syntax

Web devs will feel right at home!

Fast and intuitive syntax for retreiving data with boolean indexing.

The 'where' clause

In [28]:
df = Daru::DataFrame.new({
  a: [1,2,3,4,5,6]*100,
  b: ['a','b','c','d','e','f']*100,
  c: [11,22,33,44,55,66]*100
}, index: (1..600).to_a.shuffle)
df
Out[28]:
Daru::DataFrame:5195920 rows: 600 cols: 3
abc
1021a11
1772b22
3543c33
1634d44
2305e55
3326f66
1711a11
1232b22
4703c33
4714d44
3095e55
236f66
151a11
262b22
3123c33
4844d44
3865e55
726f66
5061a11
962b22
1833c33
904d44
4515e55
2786f66
5291a11
872b22
2563c33
4154d44
4215e55
4856f66
1391a11
4822b22
............
5136f66

Compares with a bunch of scalar quantities and returns a DataFrame wherever they return true

In [29]:
df.where(df[:a].eq(2).or(df[:c].eq(55)))
Out[29]:
Daru::DataFrame:14856680 rows: 200 cols: 3
abc
1772b22
2305e55
1232b22
3095e55
262b22
3865e55
962b22
4515e55
872b22
4215e55
4822b22
2545e55
522b22
2825e55
2672b22
3045e55
362b22
4245e55
3032b22
3535e55
3762b22
1155e55
552b22
75e55
4782b22
2395e55
3562b22
5305e55
992b22
815e55
5952b22
4365e55
............
5325e55