This notebook describes how one can use categorical data. The applicability is limited now because regression is not yet supported with Categorical Data

In [1]:
require 'daru'
Out[1]:
true

This is animal shelter data taken from kaggle compeption.

Its animals that are given up by their owner to a shelter. Lets gain some insight about this data.

In [2]:
shelter_data = Daru::DataFrame.from_csv '../data/animal_shelter_train.csv'
shelter_data.head(3)
Out[2]:
Daru::DataFrame(3x10)
AnimalID Name DateTime OutcomeType OutcomeSubtype AnimalType SexuponOutcome Breed Color AgeuponOutcome(Weeks)
0 A671945 Hambone 2014-02-12 18:22:00 Return_to_owner Dog Neutered Male Shetland Sheepdog Mix Brown/White 52.0
1 A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering Cat Spayed Female Domestic Shorthair Mix Cream Tabby 52.0
2 A686464 Pearce 2015-01-31 12:28:00 Adoption Foster Dog Neutered Male Pit Bull Mix Blue/White 104.0
In [3]:
shelter_data.shape
Out[3]:
[26711, 10]

We are not interested in DateTime, AnimalID and OutcomeSubtype so we will delete them.

Since OutcomeType, AnimalType, SexuponOutcome, Breed and Color are qualitative variable, we'll convert them to type category.

In [4]:
shelter_data.delete_vectors 'DateTime', 'AnimalID', 'OutcomeSubtype'
shelter_data.to_category 'OutcomeType', 'AnimalType', 'SexuponOutcome', 'Breed', 'Color'
shelter_data.first 5
Out[4]:
Daru::DataFrame(5x7)
Name OutcomeType AnimalType SexuponOutcome Breed Color AgeuponOutcome(Weeks)
0 Hambone Return_to_owner Dog Neutered Male Shetland Sheepdog Mix Brown/White 52.0
1 Emily Euthanasia Cat Spayed Female Domestic Shorthair Mix Cream Tabby 52.0
2 Pearce Adoption Dog Neutered Male Pit Bull Mix Blue/White 104.0
3 Transfer Cat Intact Male Domestic Shorthair Mix Blue Cream 3.0
4 Transfer Dog Neutered Male Lhasa Apso/Miniature Poodle Tan 104.0

We'll categorize AgeuponOutcome(Weeks) to get quick summary of the ages (as we will see later).

In [5]:
shelter_data['AgeuponOutcome'] = shelter_data['AgeuponOutcome(Weeks)'].cut [0, 1, 4, 52, 260, 1500], labels: [:less_than_week, :less_than_month, :less_than_year, :one_to_five_years, :more_than__five_years]
shelter_data.delete_vector 'AgeuponOutcome(Weeks)'
nil

Lets look at the categories we have formed.

In [6]:
shelter_data['AgeuponOutcome'].frequencies.sort ascending: false
Out[6]:
Daru::Vector(5)
one_to_five_years 10605
less_than_year 9965
more_than__five_years 4216
less_than_month 1505
less_than_week 420

Say we are interested in looking at percentage of each animals we have having in the shelter.

In [7]:
shelter_data['AnimalType'].frequencies :percentage
Out[7]:
Daru::Vector(2)
AnimalType
Dog 58.38044251431994
Cat 41.61955748568006

This tells us that we have 58% of dogs and 41% of cats in out dataset. Lets explore further.

Lets look at what are the possible outcomes along with their frequencies.

In [8]:
shelter_data['OutcomeType'].frequencies
Out[8]:
Daru::Vector(5)
OutcomeType
Return_to_owner 4786
Euthanasia 1553
Adoption 10769
Transfer 9406
Died 197

So, a large amount of these animals are adopted which is great.

Lets get some insight into animals who died.

In [9]:
died = shelter_data.where shelter_data['OutcomeType'].eq('Died')
died['AnimalType'].frequencies :percentage
Out[9]:
Daru::Vector(2)
AnimalType
Dog 25.380710659898476
Cat 74.61928934010153

Hmm.. Cats are more prone to die than dogs. We can say this because cats to dog ratio is almost the same in the dataset.

Lets have some insight into ages of cats and dogs that died.

In [10]:
died.where(died['AnimalType'].eq 'Dog')['AgeuponOutcome'].frequencies :percentage
Out[10]:
Daru::Vector(5)
less_than_week 12.0
less_than_month 4.0
less_than_year 24.0
one_to_five_years 40.0
more_than__five_years 20.0
In [11]:
died.where(died['AnimalType'].eq 'Cat')['AgeuponOutcome'].frequencies :percentage
Out[11]:
Daru::Vector(5)
less_than_week 11.564625850340136
less_than_month 12.244897959183673
less_than_year 57.14285714285714
one_to_five_years 12.244897959183673
more_than__five_years 6.802721088435375

Also younger cats are more prone to die.

Lets move our attention to animals which got adopted.

In [12]:
adopted = shelter_data.where shelter_data['OutcomeType'].eq('Adoption')
adopted['AnimalType'].frequencies :percentage
Out[12]:
Daru::Vector(2)
AnimalType
Dog 60.33057851239669
Cat 39.66942148760331

Hmm... Dogs are more likely to be adopted, maybe that explains why so many cats die.

Lets now look at those animals which got adopted by their owner back.

In [13]:
owner = shelter_data.where shelter_data['OutcomeType'].eq('Return_to_owner')
owner['AnimalType'].frequencies :percentage
Out[13]:
Daru::Vector(2)
AnimalType
Dog 89.55286251567071
Cat 10.447137484329295

Astonishingly 90% of dogs returns to their owner while only 10% of cats do.