This notebook explains the use of formula langauge and capability of Statsample-GLM to handle category data in regression.

This notebook based this notebook created by Alexej

Logistic regression with categorical data

We aim to fit a logistic regression model to the shelter animal data from kaggle using the Ruby gems daru and statsample-glm.

Let's first load the data.

In [1]:
require 'daru'
shelter_data = Daru::DataFrame.from_csv 'data/animal_shelter_train.csv'
p shelter_data.shape
shelter_data.head(3)
[26711, 10]
Out[1]:
Daru::DataFrame(3x10)
AnimalID Name DateTime OutcomeType OutcomeSubtype AnimalType SexuponOutcome Breed Color AgeuponOutcome(Weeks)
0 A671945 Hambone 2014-02-12 18:22:00 Return_to_owner Dog Neutered Male Shetland Sheepdog Mix Brown/White 52.0
1 A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering Cat Spayed Female Domestic Shorthair Mix Cream Tabby 52.0
2 A686464 Pearce 2015-01-31 12:28:00 Adoption Foster Dog Neutered Male Pit Bull Mix Blue/White 104.0

We need to tell Daru what vectors are category. We can do with via #to_category

In [2]:
shelter_data.to_category 'OutcomeType', 'OutcomeSubtype', 'AnimalType', 'SexuponOutcome', 'Breed', 'Color'
nil

We create a 0-1-valued indicator for whether the animal got adopted. We will then create a logistic model to predict whether an animal got adopted or not.

In [3]:
shelter_data['OutcomeType_Adoption'] = (shelter_data['OutcomeType'].contrast_code)['OutcomeType_Adoption']
shelter_data.head 3
Out[3]:
Daru::DataFrame(3x11)
AnimalID Name DateTime OutcomeType OutcomeSubtype AnimalType SexuponOutcome Breed Color AgeuponOutcome(Weeks) OutcomeType_Adoption
0 A671945 Hambone 2014-02-12 18:22:00 Return_to_owner Dog Neutered Male Shetland Sheepdog Mix Brown/White 52.0 0
1 A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering Cat Spayed Female Domestic Shorthair Mix Cream Tabby 52.0 0
2 A686464 Pearce 2015-01-31 12:28:00 Adoption Foster Dog Neutered Male Pit Bull Mix Blue/White 104.0 1

Before we create a model. Let's do some preprocessing to create an effective model.

Some data preprocessing

I am using only 600 rows for this Demo because Statsample-GLM is a bit slow in computing.

In [4]:
small = shelter_data.head 600
small.head 3
Out[4]:
Daru::DataFrame(3x11)
AnimalID Name DateTime OutcomeType OutcomeSubtype AnimalType SexuponOutcome Breed Color AgeuponOutcome(Weeks) OutcomeType_Adoption
0 A671945 Hambone 2014-02-12 18:22:00 Return_to_owner Dog Neutered Male Shetland Sheepdog Mix Brown/White 52.0 0
1 A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering Cat Spayed Female Domestic Shorthair Mix Cream Tabby 52.0 0
2 A686464 Pearce 2015-01-31 12:28:00 Adoption Foster Dog Neutered Male Pit Bull Mix Blue/White 104.0 1
In [5]:
p small['Breed'].categories.size, small['Color'].categories.size
1380
366
Out[5]:
[1380, 366]

Since, the number of categories in 'Breed' and 'Color' is large, we need club some of these categories.

Grouping Breeds

Lets have a look at the distribution.

In [6]:
small['Breed'].frequencies.sort(ascending: false).head(10)
Out[6]:
Daru::Vector(10)
Breed
Domestic Shorthair Mix 204
Chihuahua Shorthair Mix 47
Pit Bull Mix 38
Labrador Retriever Mix 33
Domestic Medium Hair Mix 17
Siamese Mix 11
Domestic Longhair Mix 11
German Shepherd Mix 10
Australian Cattle Dog Mix 8
Dachshund Mix 7

Lets merge the infrequent occuring categories into single categories 'other' so we can have less number of categories to deal with.

Here we've used #rename_categories which accepts a hash mapping old categories to new one.

In [7]:
other_cats = small['Breed'].categories.select { |i| small['Breed'].count(i) < 10 }
other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h
small['Breed'].rename_categories other_cats_hash
small['Breed'].frequencies
Out[7]:
Daru::Vector(9)
Breed
Domestic Shorthair Mix 204
Pit Bull Mix 38
German Shepherd Mix 10
Chihuahua Shorthair Mix 47
Labrador Retriever Mix 33
Domestic Longhair Mix 11
Siamese Mix 11
Domestic Medium Hair Mix 17
other 229

And let's set the base category to 'other'.

In [8]:
small['Breed'].base_category = 'other'
Out[8]:
"other"

We now do the same with 'Colors'

Grouping colors

In [9]:
p small['Color'].categories.size
small['Color'].frequencies.sort(ascending: false).head 10
366
Out[9]:
Daru::Vector(10)
Color
Black/White 66
Black 52
Brown Tabby 37
Tricolor 22
Brown/White 21
Brown Tabby/White 20
Calico 19
White 19
Tan/White 18
Brown 16
In [10]:
other_cats = small['Color'].categories.select { |i| small['Color'].count(i) < 10 }
other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h
small['Color'].rename_categories other_cats_hash
small['Color'].frequencies
Out[10]:
Daru::Vector(24)
Color
Brown/White 21
Blue/White 12
Tan 11
Black/Tan 14
Blue Tabby 10
Brown Tabby 37
White 19
Black 52
Brown 16
Orange Tabby/White 14
Black/White 66
Brown Brindle/White 10
Orange Tabby 15
Chocolate/White 11
Blue 10
Calico 19
Brown/Black 11
Tricolor 22
White/Black 10
Tortie 13
Tan/White 18
Brown Tabby/White 20
White/Brown 13
other 156
In [11]:
small['Color'].base_category = 'other'
Out[11]:
"other"

Looking at SexuponOutcome

In [12]:
small['SexuponOutcome'].frequencies
Out[12]:
Daru::Vector(6)
SexuponOutcome
Neutered Male 216
Spayed Female 205
Intact Male 78
Intact Female 77
Unknown 24
0

The last row tells us that there is a entry with category as 'nil'. Lets rename this category to 'Unknown' because 'Unknown' stores all the unkown values.

In [13]:
p small['SexuponOutcome'].categories
small['SexuponOutcome'].rename_categories nil => 'Unknown'
small['SexuponOutcome'].categories
["Neutered Male", "Spayed Female", "Intact Male", "Intact Female", "Unknown", nil]
Out[13]:
["Neutered Male", "Spayed Female", "Intact Male", "Intact Female", "Unknown"]

Split to train and test

In [14]:
train = small.head 500
test = small.tail 100
p train.size, test.size
500
100
Out[14]:
[500, 100]

Model fit

Now, having put data in appropriate form, we can fit the logistic regression model with statsample-glm.

In [16]:
m = test['OutcomeType_Adoption'].mean
"Trivial accuracy = #{[m, 1-m].max}"
Out[16]:
"Trivial accuracy = 0.5900000000000001"
In [17]:
require 'statsample-glm'

formula = 'OutcomeType_Adoption~AnimalType+Breed+AgeuponOutcome(Weeks)+Color+SexuponOutcome'
glm_adoption = Statsample::GLM::Regression.new formula, train, :logistic
glm_adoption.df_for_regression.head 5
glm_adoption.model.coefficients :hash
Out[17]:
{:AnimalType_Cat=>0.8376443692275163, :"Breed_Pit Bull Mix"=>0.28200753488859803, :"Breed_German Shepherd Mix"=>1.0518504638731023, :"Breed_Chihuahua Shorthair Mix"=>1.1960242033878856, :"Breed_Labrador Retriever Mix"=>0.445803000000512, :"Breed_Domestic Longhair Mix"=>1.898703165797653, :"Breed_Siamese Mix"=>1.5248210169271197, :"Breed_Domestic Medium Hair Mix"=>-0.19504965010288533, :Breed_other=>0.7895601504638325, :"Color_Blue/White"=>0.3748263925801828, :Color_Tan=>0.11356334165122918, :"Color_Black/Tan"=>-2.6507089126322114, :"Color_Blue Tabby"=>0.5234717706465536, :"Color_Brown Tabby"=>0.9046099720184905, :Color_White=>0.07739310267363662, :Color_Black=>0.859906249787038, :Color_Brown=>-0.003740755055106689, :"Color_Orange Tabby/White"=>0.2336674067343927, :"Color_Black/White"=>0.22564205490196415, :"Color_Brown Brindle/White"=>-0.6744314269278774, :"Color_Orange Tabby"=>2.063785952843677, :"Color_Chocolate/White"=>0.6417921901449108, :Color_Blue=>-2.1969040091451704, :Color_Calico=>-0.08386525532631824, :"Color_Brown/Black"=>0.35936722899161305, :Color_Tricolor=>-0.11440457799048752, :"Color_White/Black"=>-2.3593561796090383, :Color_Tortie=>-0.4325130799770577, :"Color_Tan/White"=>0.09637439333330515, :"Color_Brown Tabby/White"=>0.12304448360566177, :"Color_White/Brown"=>0.5867441296328475, :Color_other=>0.08821407092892847, :"SexuponOutcome_Spayed Female"=>0.32626712478395975, :"SexuponOutcome_Intact Male"=>-3.971505056680895, :"SexuponOutcome_Intact Female"=>-3.619095491410668, :SexuponOutcome_Unknown=>-102.73807712615843, :"AgeuponOutcome(Weeks)"=>-0.006959545305620043}

We can also predict using the model we just created.

In [18]:
predict = glm_adoption.predict test
predict.map! { |i| i < 0.5 ? 0 : 1 }
predict.head 5
Out[18]:
Daru::Vector(5)
0 0
1 0
2 1
3 0
4 0