When is there enough data for generalization?

Are there any general rules that one can use to infer what can be learned/generalized from a particular data set? Suppose the dataset was taken from a sample of people. Can these rules be stated as functions of the sample or total population? I understand the above may be vague, so a case scenario: Users participate in a search task, where the data are their queries, clicked results, and the HTML content (text only) of those results. Each of these are tagged with their user and timestamp. A user may generate a few pages - for a simple fact-finding task - or hundreds of pages - for a longer-term search task, like for class report. Edit: In addition to generalizing about a population, given a sample, I'm interested in generalizing about an individual's overall search behavior, given a time slice. Theory and paper references are a plus!

asked Aug 4, 2014 at 19:10 821 1 1 gold badge 8 8 silver badges 12 12 bronze badges

$\begingroup$ There's one way to get "generalization" that is good enough for any scenario - sample the entire population. In all other cases your best option is to select confidence level and take sample large enough to give reasonable confidence interval. $\endgroup$

Commented Aug 4, 2014 at 20:06

3 Answers 3

$\begingroup$

It is my understanding that random sampling is a mandatory condition for making any generalization statements. IMHO, other parameters, such as sample size, just affect probability level (confidence) of generalization. Furthermore, clarifying the @ffriend's comment, I believe that you have to calculate needed sample size, based on desired values of confidence interval, effect size, statistical power and number of predictors (this is based on Cohen's work - see References section at the following link). For multiple regression, you can use the following calculator: http://www.danielsoper.com/statcalc3/calc.aspx?id=1.

More information on how to select, calculate and interpret effect sizes can be found in the following nice and comprehensive paper, which is freely available: http://jpepsy.oxfordjournals.org/content/34/9/917.full.

If you're using R (and even, if you don't), you may find the following Web page on confidence intervals and R interesting and useful: http://osc.centerforopenscience.org/static/CIs_in_r.html.

Finally, the following comprehensive guide to survey sampling can be helpful, even if you're not using survey research designs. In my opinion, it contains a wealth of useful information on sampling methods, sampling size determination (including calculator) and much more: http://home.ubalt.edu/ntsbarsh/stat-data/Surveys.htm.

answered Aug 5, 2014 at 8:09 Aleksandr Blekh Aleksandr Blekh 6,518 4 4 gold badges 29 29 silver badges 54 54 bronze badges $\begingroup$

There are two rules for generalizability:

The sample must be representative. In expectation, at least, the distribution of features in your sample must match the distribution of features in the population. When you are fitting a model with a response variable, this includes features that you do not observe, but that affect any response variables in your model. Since it is, in many cases, impossible to know what you do not observe, random sampling is used. The idea with randomization is that a random sample, up to sampling error, must accurately reflect the distribution of all features in the population, observed and otherwise. This is why randomization is the "gold standard," but if sample control is available by some other technique, or it is defensible to argue that there are no omitted features, then it isn't always necessary.
Your sample must be large enough that the effect of sampling error on the feature distribution is relatively small. This is, again, to ensure representativeness. But deciding who to sample is different from deciding how many people to sample.

Since it sounds like you're fitting a model, there's the additional consideration that certain important combinations of features could be relatively rare in the population. This is not an issue for generalizability, but it bears heavily on your considerations for sample size. For instance, I'm working on a project now with (non-big) data that was originally collected to understand the experiences of minorities in college. As such, it was critically important to ensure that statistical power was high specifically in the minority subpopulation. For this reason, blacks and Latinos were deliberately oversampled. However, the proportion by which they were oversampled was also recorded. These are used to compute survey weights. These can be used to re-weight the sample so as to reflect the estimated population proportions, in the event that a representative sample is required.

An additional consideration arises if your model is hierarchical. A canonical use for a hierarchical model is one of children's behavior in schools. Children are "grouped" by school and share school-level traits. Therefore a representative sample of schools is required, and within each school a representative sample of children is required. This leads to stratified sampling. This and some other sampling designs are reviewed in surprising depth on Wikipedia.