Sampling Crash Volumes, Rates and Rarity for Socorro Samples

1 Introduction

The Socorro crash report accumulation pipeline does not process all the crash reports. Though every report is stored on disk, only 10% are processed and saved in HBase as JSON objects. Each crash report has a crash signature (Crash Report Signature or CRS for short). The relationship between crash reports and CRSs is many to one.

Consumers of the crash reports (engineers working on bugfixes, product managers to name a few) had concerns regarding the use of samples. For example, some asked if the 10% sampling is a viable sampling rate to accurately estimate the frequency of the CRSs and if not all of them, then how accurate are the top N most frequently observed crash report signatures? With FF's usage running into the 100 millions, we can expect new CRSs to be coming in every day. Some are very rare (occurs for a small user base) and others more frequent. How many days can we expect to wait till we see 50% of all the CRS that come in (for a given version)?

To answer these questions, the #breakpad team processed every crash report for the week 03/22-29/2011 , post Firefox 4 release. This served as a full enumeration of the crash report data. The full enumeration contained 2.2MM crash reports belonging to 84,760 CRSs.

Primarily, the crash-stats dashboard lists the top 100 most frequent crashes by OS. Some questions,

• How accurate are the sample estimates? Does the top 100 from a sample equal the top 100 from the full enumeration (population) and are the proportion estimates accurate?
• Given estimates, can we provide something about their accuracy?
• How many distinct crash types are there? Throttling is a random sample of incoming crash reports. If in a 10% sample, we observe 'N' CRSs, can we estimate how many there in the population i.e. how many haven't we seen? Estimating the number of unique CRS is entirely different from estimating the proportions of the CRS.

2 Estimating CRS Proportions and Ranks.

Figure 1 plots the estimated proportions of the top 100 most frequently observed Crash Report Signatures based on a 10% sample. 44.5% of the ~200K Crash Reports in the 10% sample belong to these 100 CRS. This percentage estimate is accurate for the population too. The black line is the estimate. The red vertical bars are 95% confidence bands.

As expected, rarer events will be difficult to estimate accurately: we need a large sample to get an accurate count of their occurrence. The Relative Error of the estimate is the (ActualPercent- Estimate)/ActualPercent. The Relative Error is expected to decrease as the Actual Percent (Proportion*100) increases. Figure 2 plots the Relative Error % vs the Actual %. The four panels increase in the range of Estimated Percent from left right (look the scale of the horizontal axes). Notice that the Relative Error decreases with a larger percent. The actual breaks for the panels come after Figure 2. To preserve resolution on the vertical scale, the most frequent CRS (~6%) has not been displayed.

Increasing the sampling from rate from 10% to 17.5% decreases the maximum Relative Difference to 6 and trend across panels stays the same.

Panel   Minimum Actual Percent  Maximum Actual Percent  Counts
1       0.1425045              0.1860138                  25
2       0.1861036              0.2683228                  25
3       0.2681434              0.3975951                  25
4       0.4132495              2.0706431                  25


With the exception of a few changes in ranks (displacements in rank is less than 8 ) among the bottom 80-100 CRS, the order of the top 100 CRSs matches those of the full enumeration. The 95% CI bands do not overlap, though we haven't done any multiple comparison tests.

3 Estimating the Number of Unseen Crash Reports

In the full enumeration there are 84,760 unique signatures. How do we estimate the number of signatures of the population based on what is observed in a sample? This is not the same as estimating the total from a random sample, since we are not randomly sampling from Crash Report Signatures but from Crash Reports.

This is not new. The topic of Species Richness (see http://en.wikipedia.org/wiki/Species_richness ) is measure of number of unique species in a region. It is easy enough to record how many tulips, roses and daisies there exist in a region. How to estimate how many species of plants? Even if we haven't encountered them? Species Richness is a measure of the biological diversity in a given region. In our case, the species is the Crash Report Signature. For example, if region A has 4 species of flowers distributed in the ratio 80:10:1:9 and another (call it B) with 4 species in the ratios of 25:25:25:25, then both have the same species richness (4). Richness is still a useful quantity to look at: it is the number of crash signatures found.

If $$S_n$$ is the number of crash signatures found in a sample of size n from a population of size $$N$$, the expectation is denoted by $$E[S_n]$$ and it's variance is $$V[S_n]$$. $$E[S_n]$$ is used for estimating sample sizes required to accurately estimate $$S$$. An approximate formula for $$E[S_n]$$ (see 1) is $$E[S_n] = S-\sum^{n}(1-N_i/N)^n$$ (where $$N_i$$ are the population sizes of CRS i). This ties into the concept of rarefaction, a technique to compare species richness from different sample sizes. Generally speaking, higher the sampling rate more the signatures to be found. Figure 3 is graph of $$E[S_n]$$ expressed as a percentage of $$S$$ vs. $$n$$. Keep in mind that $$E[S_n]$$ is biased below the true value. The reader can see that even with 90% of the population sampled, the expectation of $$S$$ is 76% of the true value. Another caveat is that $$N_i$$ observed is from the 1 week full enumeration which itself is a temporal snapshot of several weeks of Crash Report data.

[[ It might be of more use to consider the Diversity: Region A has 80% of the first species whereas B has the species uniformly distributed. The species diversity (see 2) takes the relative abundance of the species into account. It's definition is $$(1- \sum_{i=1}^{S} \pi_i^2)$$ where $$S$$ is the total number of species and $$\pi_i$$ are the proportions of specie $$i$$ in the population. It is the probability of two randomly observed reports belonging to different signatures. Diversity is a measure of the heterogeneity of the signatures, a function of the probability of observing two different signatures. Diversity can be used for ranking the heterogeneity of signatures of two different version of Firefox. In this example, we'll use it to measure the hourly/daily change of signature diversity. Diversity tells us the prevalence of crash signatures based on their probability of occurrence - rarer ones contribute less to the index. Variations on this (possibly more useful for our case) is to consider: the top hundred most frequently observed signatures, normalize their proportions and compare the diversity of the top 100. My fear is that the the long tail provides little information to the diversity. Figure 4 plots the diversity vs. hours since release - notice the dip an then resurgence. The dip probably means more crashes of a smaller set of signatures. The uptick is probably because of more users and a wider variety of behaviors therefore causing more signatures. Figure 5 is the density vs. day (day 0 to day 7) - notice how the distribution becomes more pronounced (variation reduces) and shifts to the right.

Lastly, we'll look at the diversity index as it changes over cumulative hours, i.e. diversity till the first hour, diversity up till the 2nd hour and so on. By the end of the second day, there is little change in the diversity index, indicating either no new signatures (not the case) or the new ones occur very rarely.

4 Future

We would like to use crash report data to compare stability of versions. Could it be using the number of signatures that account for 50% of the crash reports? One approach we can toy with is:

From a 10% sample, compute the number of signatures that account for 50% of the reports. This number is a version of richness. Also for these compute the diversity index (having normalized the proportions). This approach ignores the very long tail.

On another note, How may days does it take to pick up the bulk of the crash reports? Upon release, do the counts of unique CRS come in slowly? Or because of the skew in the CRS distribution can we expect to observe a bulk very early on. Admittedly, the question is vaguely worded however in the subsequent sections we'll try and be more specific.

References

Footnotes:

2 vegan: Community Ecology Package, Jari Oksanen, F. Guillaume Blanchet, Roeland Kindt, Pierre Legendre, Peter R. Minchin, R. B. O'Hara, Gavin L. Simpson, Peter Solymos, M. Henry H. Stevens, Helene Wagner, http://cran.r-project.org/web/packages/vegan/index.html

3 A Primer of Ecology with R, M. Henry, Springer, 2nd Printing 1 Explicit Calculation of the Rarefaction Diversity Measurement and the Determination of Sufficient Sample Size, Kenneth L. Heck, Jr., Gerald van Belle, Daniel Simberloff, Ecology, Vol. 56, No. 6, (Autumn, 1975), pp. 1459-1461

1 Explicit Calculation of the Rarefaction Diversity Measurement and the Determination of Sufficient Sample Size, Kenneth L. Heck, Jr., Gerald van Belle, Daniel Simberloff, Ecology, Vol. 56, No. 6, (Autumn, 1975), pp. 1459-1461

2 The Nonconcept of Species Diversity: A Critique and Alternative Parameters, Stuart H. Hurlbert, Ecology, Vol. 52, No. 4, 577-587

Date: 2012-02-21 15:12:55 PST

Org version 7.8.03 with Emacs version 24

Validate XHTML 1.0