This is a fun example. Suppose we wish to predict how many sign-ups there are on Github.com. Officially, Github does not release an up-to-date count, and at last offical annoucment (January 2013) the count was 3 million. What if we wish to measure it today? We could extrapolate future numbers from previous annoucements, but this uses little data and we could potentially be off by hundreds of thousands, and you are essentially just curve fitting complicated models.
Instead, what we are going to use is user id
numbers from real-time feeds on Github. The script github_events.py
will pull the most recent 300 events from the Github Public Timeline feed (we'll be accessing data using their API). From this, we pull out the user ids
associated with each event. We run the script below and display some output:
%run github_events.py
print "Some User ids from the latest events (push, star, fork etc.) on Github."
print ids[:10]
print
print "Number of unique ids found: ", ids.shape[0]
print "Largest user id: ", ids.max()
figsize(12.5,3)
plt.hist( ids, bins = 45, alpha = 0.9)
plt.title("Histogram of %d Github User ids"%ids.shape[0] );
plt.xlabel("User id")
plt.ylabel("Frequency");
There are some users with multiple events, but we are only interested in unique user ids
, hence why we have less than 300 ids. Above I printed the largest user id
. Why is this important? If Github assigns user ids
serially, which is a fair assumption, then we know that there are certainly more users than that number. Remember, we are only looking at less than 300 individuals out of a much larger population, so it is unlikely that we would have sampled the most recent sign-up.
At best, we can only estimate the total number of sign-ups. Let's get more familar with this problem. Consider a fictional website that we wish to estimate the number of users:
Suppose we sampled only two individuals in a similar manner to above: the ids are 3 and 10 respectively. Would it be likely that the website has millions of users? Not very. Alternatively, it is more likely the website has less than 100 users.
On the other hand, if the ids were 3 and 34 989, we might be more willing to guess there could possibly thousands, or millions of user sign-ups. We are not very confident in an estimate, due to a lack of data.
If we sample thousands of users, and the maximum user id
is still 34 989, then is seems likely that the total number of sign ups is near 35 000. Hence our inference should be more confident.
We make the following assumption:
Assumption: Every user is equally likely to perform an event. Clearly, looking at the above histogram, this assumption is violated. The participation on Github is skewed towards early adopters, likely as these early-adopting individuals have a) more invested in Github, and b) saw the value earlier in Github, therefore are more interested in it. The distribution is also skewed towards new sign ups, who likely signed up just to push a project.
To create a Bayesian model of this is easy. Based on the above assumption, all user_ids
sampled are from a DiscreteUniform
model, with lower bound 1 and an unknown upperbound. We don't have a strong belief about what the upper-bound might be, but we do know it will be larger than ids.max()
.
Working with such large numbers can cause numerical problem, hence we will scale everything by a million. Thus, instead of a DiscreteUniform
, we will used a Uniform
:
FACTOR = 1000000.
import pymc as pm
upper_bound = pm.Uniform( "n_sign_ups", ids.max()/FACTOR, (ids.max())/FACTOR + 1)
obs = pm.Uniform("obs", 0, upper_bound, value = ids/FACTOR, observed = True )
#code to be examplained in Chp. 3.
mcmc = pm.MCMC([upper_bound, obs] )
mcmc.sample( 100000, 45000)
from scipy.stats.mstats import mquantiles
samples = mcmc.trace("n_sign_ups")[:]
hist(samples, bins = 100,
label = "Uniform prior",
normed=True, alpha = 0.8,
histtype="stepfilled", color = "#7A68A6" );
quantiles_mean = np.append( mquantiles( samples, [0.05, 0.5, 0.95]), samples.mean() )
print "Quantiles: ", quantiles_mean[:3]
print "Mean: ", quantiles_mean[-1]
plt.vlines( quantiles_mean, 0, 33,
linewidth=2, linestyles = ["--", "--", "--", "-"],
)
plt.title("Posterior distribution of total number of Github users" )
plt.xlabel("number of users (in millions)")
plt.legend()
plt.xlim( ids.max()/FACTOR - 0.01, ids.max()/FACTOR + 0.12 );
Above we have plotted the posterior distribution. Note that there is no posterior probability assigned to the number of users being less than ids.max()
. That is good, as it would be an impossible situation.
The three dashed vertical bars, from left to right, are the 5%, 50% and 95% quantitle lines. That is, 5% of the probability is before the first line, 50% before the second and 95% before the third. The 50% quantitle is also know as the median and is a better measure of centrality than the mean for heavily skewed distributions like this one. The solid line is the posterior distribution's mean.
So what can we say? Using the data above, there is a 95% chance that there are less than 4.4 million users, and is probably around 4.36 million users. I was wondering how accurate this figure was. At the time of this writing, it seems a bit high considering only five months prior the number was at 3 million:
Last night @github crossed the 3M user mark #turntup
— Rick Bradley (@rickbradley) January 15, 2013
I thought perhaps the user_id
parameter was being used liberally to users/bots/changed names etc, so I contacted Github Support about it:
@cmrn_dp User IDs are assigned to new users/organizations, whether they’re controlled by humans, groups, or bots.
— GitHub Support (@GitHubHelp) May 6, 2013
So we may be overestimating by including organizations, which perhaps should not be counted as users. TODO: estimate the number of organizations. Any takers?