This notebook explains the use of formula langauge and capability of Statsample-GLM to handle category data in regression.

This notebook based this notebook created by Alexej

Logistic regression with categorical data¶

We aim to fit a logistic regression model to the shelter animal data from kaggle using the Ruby gems daru and statsample-glm.

Let's first load the data.

require 'daru'
shelter_data = Daru::DataFrame.from_csv 'data/animal_shelter_train.csv'
p shelter_data.shape
shelter_data.head(3)

[26711, 10]

We need to tell Daru what vectors are category. We can do with via #to_category

shelter_data.to_category 'OutcomeType', 'OutcomeSubtype', 'AnimalType', 'SexuponOutcome', 'Breed', 'Color'
nil

We create a 0-1-valued indicator for whether the animal got adopted. We will then create a logistic model to predict whether an animal got adopted or not.

shelter_data['OutcomeType_Adoption'] = (shelter_data['OutcomeType'].contrast_code)['OutcomeType_Adoption']
shelter_data.head 3

Before we create a model. Let's do some preprocessing to create an effective model.

Some data preprocessing¶

I am using only 600 rows for this Demo because Statsample-GLM is a bit slow in computing.

small = shelter_data.head 600
small.head 3

p small['Breed'].categories.size, small['Color'].categories.size

1380
366

[1380, 366]

Since, the number of categories in 'Breed' and 'Color' is large, we need club some of these categories.

Grouping Breeds¶

Lets have a look at the distribution.

small['Breed'].frequencies.sort(ascending: false).head(10)

Lets merge the infrequent occuring categories into single categories 'other' so we can have less number of categories to deal with.

Here we've used #rename_categories which accepts a hash mapping old categories to new one.

other_cats = small['Breed'].categories.select { |i| small['Breed'].count(i) < 10 }
other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h
small['Breed'].rename_categories other_cats_hash
small['Breed'].frequencies

And let's set the base category to 'other'.

small['Breed'].base_category = 'other'

"other"

We now do the same with 'Colors'

Grouping colors¶

p small['Color'].categories.size
small['Color'].frequencies.sort(ascending: false).head 10

366

other_cats = small['Color'].categories.select { |i| small['Color'].count(i) < 10 }
other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h
small['Color'].rename_categories other_cats_hash
small['Color'].frequencies

small['Color'].base_category = 'other'

"other"

Looking at SexuponOutcome¶

small['SexuponOutcome'].frequencies

The last row tells us that there is a entry with category as 'nil'. Lets rename this category to 'Unknown' because 'Unknown' stores all the unkown values.

p small['SexuponOutcome'].categories
small['SexuponOutcome'].rename_categories nil => 'Unknown'
small['SexuponOutcome'].categories

["Neutered Male", "Spayed Female", "Intact Male", "Intact Female", "Unknown", nil]

["Neutered Male", "Spayed Female", "Intact Male", "Intact Female", "Unknown"]

Split to train and test¶

train = small.head 500
test = small.tail 100
p train.size, test.size

500
100

[500, 100]

Model fit¶

Now, having put data in appropriate form, we can fit the logistic regression model with statsample-glm.

m = test['OutcomeType_Adoption'].mean
"Trivial accuracy = #{[m, 1-m].max}"

"Trivial accuracy = 0.5900000000000001"

require 'statsample-glm'

formula = 'OutcomeType_Adoption~AnimalType+Breed+AgeuponOutcome(Weeks)+Color+SexuponOutcome'
glm_adoption = Statsample::GLM::Regression.new formula, train, :logistic
glm_adoption.df_for_regression.head 5
glm_adoption.model.coefficients :hash

{:AnimalType_Cat=>0.8376443692275163, :"Breed_Pit Bull Mix"=>0.28200753488859803, :"Breed_German Shepherd Mix"=>1.0518504638731023, :"Breed_Chihuahua Shorthair Mix"=>1.1960242033878856, :"Breed_Labrador Retriever Mix"=>0.445803000000512, :"Breed_Domestic Longhair Mix"=>1.898703165797653, :"Breed_Siamese Mix"=>1.5248210169271197, :"Breed_Domestic Medium Hair Mix"=>-0.19504965010288533, :Breed_other=>0.7895601504638325, :"Color_Blue/White"=>0.3748263925801828, :Color_Tan=>0.11356334165122918, :"Color_Black/Tan"=>-2.6507089126322114, :"Color_Blue Tabby"=>0.5234717706465536, :"Color_Brown Tabby"=>0.9046099720184905, :Color_White=>0.07739310267363662, :Color_Black=>0.859906249787038, :Color_Brown=>-0.003740755055106689, :"Color_Orange Tabby/White"=>0.2336674067343927, :"Color_Black/White"=>0.22564205490196415, :"Color_Brown Brindle/White"=>-0.6744314269278774, :"Color_Orange Tabby"=>2.063785952843677, :"Color_Chocolate/White"=>0.6417921901449108, :Color_Blue=>-2.1969040091451704, :Color_Calico=>-0.08386525532631824, :"Color_Brown/Black"=>0.35936722899161305, :Color_Tricolor=>-0.11440457799048752, :"Color_White/Black"=>-2.3593561796090383, :Color_Tortie=>-0.4325130799770577, :"Color_Tan/White"=>0.09637439333330515, :"Color_Brown Tabby/White"=>0.12304448360566177, :"Color_White/Brown"=>0.5867441296328475, :Color_other=>0.08821407092892847, :"SexuponOutcome_Spayed Female"=>0.32626712478395975, :"SexuponOutcome_Intact Male"=>-3.971505056680895, :"SexuponOutcome_Intact Female"=>-3.619095491410668, :SexuponOutcome_Unknown=>-102.73807712615843, :"AgeuponOutcome(Weeks)"=>-0.006959545305620043}

We can also predict using the model we just created.

predict = glm_adoption.predict test
predict.map! { |i| i < 0.5 ? 0 : 1 }
predict.head 5

Daru::DataFrame(3x10)
	AnimalID	Name	DateTime	OutcomeType	OutcomeSubtype	AnimalType	SexuponOutcome	Breed	Color	AgeuponOutcome(Weeks)
0	A671945	Hambone	2014-02-12 18:22:00	Return_to_owner		Dog	Neutered Male	Shetland Sheepdog Mix	Brown/White	52.0
1	A656520	Emily	2013-10-13 12:44:00	Euthanasia	Suffering	Cat	Spayed Female	Domestic Shorthair Mix	Cream Tabby	52.0
2	A686464	Pearce	2015-01-31 12:28:00	Adoption	Foster	Dog	Neutered Male	Pit Bull Mix	Blue/White	104.0

Daru::DataFrame(3x11)
	AnimalID	Name	DateTime	OutcomeType	OutcomeSubtype	AnimalType	SexuponOutcome	Breed	Color	AgeuponOutcome(Weeks)	OutcomeType_Adoption
0	A671945	Hambone	2014-02-12 18:22:00	Return_to_owner		Dog	Neutered Male	Shetland Sheepdog Mix	Brown/White	52.0	0
1	A656520	Emily	2013-10-13 12:44:00	Euthanasia	Suffering	Cat	Spayed Female	Domestic Shorthair Mix	Cream Tabby	52.0	0
2	A686464	Pearce	2015-01-31 12:28:00	Adoption	Foster	Dog	Neutered Male	Pit Bull Mix	Blue/White	104.0	1

Daru::DataFrame(3x11)
	AnimalID	Name	DateTime	OutcomeType	OutcomeSubtype	AnimalType	SexuponOutcome	Breed	Color	AgeuponOutcome(Weeks)	OutcomeType_Adoption
0	A671945	Hambone	2014-02-12 18:22:00	Return_to_owner		Dog	Neutered Male	Shetland Sheepdog Mix	Brown/White	52.0	0
1	A656520	Emily	2013-10-13 12:44:00	Euthanasia	Suffering	Cat	Spayed Female	Domestic Shorthair Mix	Cream Tabby	52.0	0
2	A686464	Pearce	2015-01-31 12:28:00	Adoption	Foster	Dog	Neutered Male	Pit Bull Mix	Blue/White	104.0	1

Daru::Vector(10)
	Breed
Domestic Shorthair Mix	204
Chihuahua Shorthair Mix	47
Pit Bull Mix	38
Labrador Retriever Mix	33
Domestic Medium Hair Mix	17
Siamese Mix	11
Domestic Longhair Mix	11
German Shepherd Mix	10
Australian Cattle Dog Mix	8
Dachshund Mix	7

Daru::Vector(10)
	Color
Black/White	66
Black	52
Brown Tabby	37
Tricolor	22
Brown/White	21
Brown Tabby/White	20
Calico	19
White	19
Tan/White	18
Brown	16

Daru::Vector(5)
0	0
1	0
2	1
3	0
4	0