We aim to fit a logistic regression model to the shelter animal data from kaggle using the Ruby gems daru
and statsample-glm
.
Let's first load the data.
require 'daru'
shelter_data = Daru::DataFrame.from_csv 'data/animal_shelter_train.csv'
p shelter_data.shape
shelter_data.head(3)
We need to tell Daru what vectors are category. We can do with via #to_category
shelter_data.to_category 'OutcomeType', 'OutcomeSubtype', 'AnimalType', 'SexuponOutcome', 'Breed', 'Color'
nil
We create a 0-1-valued indicator for whether the animal got adopted. We will then create a logistic model to predict whether an animal got adopted or not.
shelter_data['OutcomeType_Adoption'] = (shelter_data['OutcomeType'].contrast_code)['OutcomeType_Adoption']
shelter_data.head 3
Before we create a model. Let's do some preprocessing to create an effective model.
I am using only 600 rows for this Demo because Statsample-GLM is a bit slow in computing.
small = shelter_data.head 600
small.head 3
p small['Breed'].categories.size, small['Color'].categories.size
Since, the number of categories in 'Breed' and 'Color' is large, we need club some of these categories.
Lets have a look at the distribution.
small['Breed'].frequencies.sort(ascending: false).head(10)
Lets merge the infrequent occuring categories into single categories 'other' so we can have less number of categories to deal with.
Here we've used #rename_categories which accepts a hash mapping old categories to new one.
other_cats = small['Breed'].categories.select { |i| small['Breed'].count(i) < 10 }
other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h
small['Breed'].rename_categories other_cats_hash
small['Breed'].frequencies
And let's set the base category to 'other'.
small['Breed'].base_category = 'other'
We now do the same with 'Colors'
p small['Color'].categories.size
small['Color'].frequencies.sort(ascending: false).head 10
other_cats = small['Color'].categories.select { |i| small['Color'].count(i) < 10 }
other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h
small['Color'].rename_categories other_cats_hash
small['Color'].frequencies
small['Color'].base_category = 'other'
small['SexuponOutcome'].frequencies
The last row tells us that there is a entry with category as 'nil'. Lets rename this category to 'Unknown' because 'Unknown' stores all the unkown values.
p small['SexuponOutcome'].categories
small['SexuponOutcome'].rename_categories nil => 'Unknown'
small['SexuponOutcome'].categories
train = small.head 500
test = small.tail 100
p train.size, test.size
Now, having put data in appropriate form, we can fit the logistic regression model with statsample-glm
.
m = test['OutcomeType_Adoption'].mean
"Trivial accuracy = #{[m, 1-m].max}"
require 'statsample-glm'
formula = 'OutcomeType_Adoption~AnimalType+Breed+AgeuponOutcome(Weeks)+Color+SexuponOutcome'
glm_adoption = Statsample::GLM::Regression.new formula, train, :logistic
glm_adoption.df_for_regression.head 5
glm_adoption.model.coefficients :hash
We can also predict using the model we just created.
predict = glm_adoption.predict test
predict.map! { |i| i < 0.5 ? 0 : 1 }
predict.head 5