require 'daru'
require 'distribution'
require 'gnuplotrb'
Vectors are indexed by passing data using the index
option, and named with name
vector = Daru::Vector.new(
[20,40,25,50,45,12], index: ['cherry', 'apple', 'barley', 'wheat', 'rice', 'sugar'],
name: "Prices of stuff.")
Specify the index you want to retrieve in the #[]
operator
vector['rice']
Multiple values can be retreived at the same time as another Daru::Vector by separating them with commas.
vector['rice', 'wheat', 'sugar']
Specifying a range of indexes will retrieve a slice of the Daru::Vector
vector['barley'..'sugar']
Assign a value by specifying the index directly to the #[]= operator
vector['barley'] = 1500
vector
The :index
option is used for specifying the row index of the DataFrame and the :order
option determines the order in which they will be stored.
Note that this is only one way of creating a DataFrame. There are around 8 different ways you can do so, depending on your use case.
df = Daru::DataFrame.new({
'col0' => [1,2,3,4,5,6],
'col2' => ['a','b','c','d','e','f'],
'col1' => [11,22,33,44,55,66]
},
index: ['one', 'two', 'three', 'four', 'five', 'six'],
order: ['col0', 'col1', 'col2']
)
A DataFrame column can be accessed using the DataFrame#[] operator.
Note that it returns a Daru::Vector
df['col1']
Multiple columns can be accessed by separating them with a comma. The result is another DataFrame.
df['col2', 'col0']
A slice of the DataFrame by columns can be obtained by specifying a Range in #[]
df['col1'..'col2']
You can assign a Daru::Vector to a column and the indexes of the Vector will be automatically matched to that of the DataFrame.
df['col1'] = Daru::Vector.new(['this', 'is', 'some','new','data','here'],
index: ['one', 'three','two','six','four', 'five'])
df
A single row can be accessed using the #row[]
function.
df.row['four']
Specifying a Range of Row indexes in #row[]
will select a DataFrame with those rows
df.row['three'..'five']
You can also assign a Row with Daru::Vector. Notice that indexes are mathced according to the order of the DataFrame.
df.row['five'] = [666,555,333]
A host of static and rolling statistics methods are provided on Daru::Vector.
Note that missing data (very common in most real world scenarios) is gracefully handled
vector = Daru::Vector.new([1,3,5,nil,2,53,nil])
vector.mean
DataFrame statistics will basically apply the concerned method on all numerical columns of the DataFrame.
df.mean
Useful statistics about the vectors in a DataFrame can be observed with #describe
df.describe
Daru offers a robust time series manipulation API for indexing data based on timestamps. This makes daru a viable tool for analyzing financial data (or any data that changes with time)
The DateTimeIndex is a special index for indexing data based on timestamps.
A date index range can be created using the DateTimeIndex.date_range function. The :freq
option decides the time frequency between each timestamp in the date index.
index = Daru::DateTimeIndex.date_range(:start => '2012', :periods => 1000, :freq => '3D')
A Daru::Vector can be created by simply passing the newly created index object into the :index
argument.
timeseries = Daru::Vector.new(1000.times.map {rand}, index: index)
When a Vector or DataFrame is indexed by a DateTimeIndex, it allows you to partially specify the date to retreive all the data that belongs to that date.
For example, to access all the data belonging to the year 2012.
timeseries['2012']
Or to access data whose time stamp is March 2012...
timeseries['2012-3']
Specifying the date precisely will return the exact data point (You can also pass a ruby DateTime object for precisely obtaining data).
timeseries['2012-3-10']
Say you have per second data about the price of a commodity and want to access the prices for the minute on 23rd of March 2012 at 12:42 pm
index = Daru::DateTimeIndex.date_range(
:start => '2012-3-23 11:00', :periods => 20000, :freq => 'S')
seconds_ts = Daru::Vector.new(20000.times.map { rand(50) }, index: index)
seconds_ts['2012-3-23 12:42']
Plotting a simple scatter plot from a DataFrame. Nyaplot integration provides interactivity.
DataFrame denoting Ice Cream sales of a particular food chain in a city according to the maximum recorded temperature in that city. It also lists the staff strength present in each city.
df = Daru::DataFrame.new({
:temperature => [30.4, 23.5, 44.5, 20.3, 34, 24, 31.45, 28.34, 37, 24],
:sales => [350, 150, 500, 200, 480, 250, 330, 400, 420, 560],
:city => ['Pune', 'Delhi']*5,
:staff => [15,20]*5
})
df
The plot below is between Temperature in the city and the sales of ice cream.
df.plot(type: :scatter, x: :temperature, y: :sales) do |plot, diagram|
plot.x_label "Temperature"
plot.y_label "Sales"
plot.yrange [100, 600]
plot.xrange [15, 50]
diagram.tooltip_contents([:city, :staff])
# Set the color scheme for this diagram.
diagram.color(Nyaplot::Colors.qual)
# Change color of each point WRT to the city that it belongs to.
diagram.fill_by(:city)
# Shape each point WRT to the city that it belongs to.
diagram.shape_by(:city)
end
rng = Distribution::Normal.rng
index = Daru::DateTimeIndex.date_range(:start => '2012-4-2', :periods => 1000)
vector = Daru::Vector.new(1000.times.map {rng.call}, index: index)
vector = vector.cumsum
rolling_mean = vector.rolling_mean 60
GnuplotRB::Plot.new(
[vector , with: 'lines', title: 'Vector'],
[rolling_mean, with: 'lines', title: 'Rolling Mean'],
xlabel: 'Time', ylabel: 'Value'
)
df = Daru::DataFrame.new({
a: [1,2,3,4,5,6]*100,
b: ['a','b','c','d','e','f']*100,
c: [11,22,33,44,55,66]*100
}, index: (1..600).to_a.shuffle)
df
Compares with a bunch of scalar quantities and returns a DataFrame wherever they return true
df.where(df[:a].eq(2).or(df[:c].eq(55)))