From Kaggle
In order to achieve this we have created a simulated data set with 200 variables and 20,000 cases. An ‘equation’ based on this data was created in order to generate a Target to be predicted. Given the all 20,000 cases, the problem is very easy to solve – but you only get given the Target value of 250 cases – the task is to build a model that gives the best predictions on the remaining 19,750 cases.
import gzip
import requests
import zipfile
url = "https://dl.dropbox.com/s/lnly9gw8pb1xhir/overfitting.zip"
results = requests.get(url)
import StringIO
z = zipfile.ZipFile(StringIO.StringIO(results.content))
#z.extractall()
z.extractall()
z.namelist()
d = z.open('overfitting.csv')
d.readline()
import numpy as np
M = np.fromstring(d.read(), sep="," )
len(d.read())
np.fromstring?
data = np.loadtxt("overfitting.csv", delimiter=",", skiprows=1)
print """
There are also 5 other fields,
case_id - 1 to 20,000, a unique identifier for each row
train - 1/0, this is a flag for the first 250 rows which are the training dataset
Target_Practice - we have provided all 20,000 Targets for this model, so you can develop your method completely off line.
Target_Leaderboard - only 250 Targets are provided. You submit your predictions for the remaining 19,750 to the Kaggle leaderboard.
Target_Evaluate - again only 250 Targets are provided. Those competitors who beat the 'benchmark' on the Leaderboard will be asked to make one further submission for the Evaluation model.
"""
data.shape
ix_training = data[:,1] == 1
ix_testing = data[:,1] == 0
training_data = data[ ix_training, 5: ]
testing_data = data[ ix_testing, 5: ]
training_labels = data[ ix_training, 2]
testing_labels = data[ ix_testing, 2]
print "training:", training_data.shape, training_labels.shape
print "testing: ", testing_data.shape, testing_labels.shape
He mentions that the X variables are from a Unifrom distribution. Let's investigate this:
figsize( 12, 4 )
hist( training_data.flatten() )
print training_data.shape[0]*training_data.shape[1]
looks pretty right
import pymc as mc
to_include = mc.Bernoulli( "to_include", 0.5, size= 200 )
coef = mc.Uniform( "coefs", 0, 1, size = 200 )
@mc.deterministic
def Z( coef = coef, to_include = to_include, data = training_data ):
ym = np.dot( to_include*training_data, coef )
return ym - ym.mean()
@mc.deterministic
def T( z = Z ):
return 0.45*(np.sign(z) + 1.1)
obs = mc.Bernoulli( "obs", T, value = training_labels, observed = True)
model = mc.Model( [to_include, coef, Z, T, obs] )
map_ = mc.MAP( model )
map_.fit()
mcmc = mc.MCMC( model )
mcmc.sample(100000, 90000,1)
(np.round(T.value) == training_labels ).mean()
t_trace = mcmc.trace("T")[:]
(np.round( t_trace[-500:-400,:]).mean(axis=0) == training_labels ).mean()
t_mean = np.round( t_trace).mean(axis=1)
imshow(t_trace[-10000:,:], aspect="auto")
colorbar()
figsize( 23, 8)
coef_trace = mcmc.trace("coefs")[:]
imshow(coef_trace[-10000:,:], aspect="auto", cmap=pyplot.cm.RdBu, interpolation="none")
include_trace = mcmc.trace("to_include")[:]
figsize( 23, 8)
imshow(include_trace[-10000:,:], aspect="auto", interpolation="none")