Statistical Definitions

SOME DEFINITIONS...

Updated 2010-03-02

An experiment that can result in different outcomes, even though it is repeated in the same manner every time, is called a random experiment.

The set of all possible outcomes of a random experiment is called the sample space of the experiment.

An event is a subspace of the sample space of a random experiment.

A sample space is discrete if it consists of a finite (or countably infinite) set of outcomes.

Probability is a measure of certainty on a scale of 0 to 1. The probability of an impossible event is 0, the probability of an inevitable event is 1. If A and B are events, then P(A+B) = P(A) + P(B) - P(A.B), where A+B denotes set union and A.B denotes set intersection. Any one of the following three definitions of probability can be used to assign a probability to an event that is neither impossible nor inevitable.

The relative frequency definition of probability applies when the sample space consists of elementary outcomes which, through physical symmetry, are recognized as being equally likely. The probability of an event E is the number of elementary outcomes in E divided by the number of elementary outcomes in the sample space.

The limiting frequency definition of probability applies when you can envisage a sequence of independent trials. Consider the number of trials that result in an event E, divided by the total number of trials. The probability of E is the hypothetical limit to which any such series of trials will tend.

The subjective definition of probability defines your personal probability of an event E as the maximum amount of money you are willing to bet in order to win $1 if E occurs.

A random variable is a function that assigns a real number to each outcome in the sample space of a random experiment.

A discrete random variable is a random variable with a finite (or countably infinite) range.

A continuous random variable is a random variable with an interval (either finite or infinite) of real numbers for its range.

The probability density function for a random variable X is a non-negative function f(x) which gives the relative probability of each point x in the sample space of X. It integrates to 1 over the whole sample space. The integral over any subset of the sample space gives the probability that X will fall in that subset.

The probability density function for a discrete-valued random variable is sometimes called a probability mass function.

A parameter is a scalar or vector that indexes a family of probability distributions.

The expected value or mean or average of a random variable is computed as a sum (or integral) over all possible values of the random variable, each weighted by the probability of getting that value. It can be interpreted as the centre of mass of the probability distribution.

The variance of a random variable is the expected squared deviation from the mean.

The covariance of two random variables is the expected product of their deviations from their respective means.

The correlation coefficient is a dimensionless measure of association between two random variables. Pearson's correlation coefficient is computed as their covariance divided by the product of their standard deviations. It ranges from +1 (perfect linear relationship with positive slope) to -1 (perfect linear relationship with negative slope), with 0 indicating no relationship.

A time series is a sequence of observations ordered in time or space.

Autocorrelation is correlation between consecutive observations in a time series. A sequence of independent observations has zero autocorrelation.

When approximating a discrete distribution (defined on integer values) by a continuous distribution, the sum of probabilities up to and including the probability at a point x is usually approximated by the area under the continuous distribution up to x + 0.5; the 0.5 is called the continuity correction.

A population consists of the totality of the observations with which we are concerned.

A sample is a subset of observations selected from the population.

A statistic is any function of the observations in a sample. It may not include any unknown parameters.

The distribution of a statistic is called a sampling distribution. It describes how the statistic will vary from one sample to another.

A statistic will be of use in a given application only if its sampling distribution depends on the parameter of interest. If it also depends on a parameter that is not of interest, that parameter is called a nuisance parameter.

A pivotal quantity is a function of a statistic and the parameter of interest that follows a standard distribution. The distribution may not include any unknown parameters.

A confidence interval is a random interval which includes the true value of the parameter of interest with probability 1-a. When we have computed a confidence interval from data, it is fixed by the data and no longer random, so we say that we are 100(1-a)% confident that it includes the true value of the parameter.

In parametric statistical inference, an hypothesis is a statement about the parameters of a probability distribution.

A null hypothesis states that there is no difference between the hypothesized value of a parameter and its true value.

The alternative hypothesis is an hypothesis that applies when the null hypothesis is false.

A simple hypothesis specifies a single value for a parameter, a composite hypothesis specifies more than one value, or a range of values.

A test statistic can be derived from a pivotal quantity by replacing the unknown parameter by its hypothesized value.

The distribution of a test statistic when the null hypothesis is true is called the reference distribution for the test.

Rejecting the null hypothesis when it is true is defined as a type I error. Failing to reject the null hypothesis when it is false is defined as a type II error.

In an accept-reject test of hypothesis, the conditional probability of committing a type I error, given that the hypothesis is true, is called the level of significance of the test.

In an accept-reject test of hypothesis, the conditional probability of rejecting the null hypothesis, given that the alternative hypothesis is true, is called the power of the test. The type II error rate is computed as (1-power).

There are three definitions of P-value. Satisfy yourself that all three mean exactly the same thing.

(1) P-value is the smallest level of significance that will lead to rejection of the null hypothesis with the given data.

(2) P-value is the largest level of significance that will lead to acceptance of the null hypothesis with the given data.

(3) P-value is the probability of getting a value of the test statistics as extreme as, or more extreme than, the value observed, if the null hypothesis were true. The alternative hypothesis determines the direction of "extreme".

A statistical method is said to be robust if it does what it is supposed to do even when the assumptions on which it is based are not satisfied. (For example, the z-test for a normal mean when the variance is known is robust against non-normality, but not against dependent data or an incorrectly specified variance.)

In the simple linear regression model, the conditional mean of the dependent variable (also called the Y-variable or response variable or predicted variable) is a linear function of a single independent variable (also called the X-variable or explanatory variable or predictor variable or covariate).

The term regression comes from breeding experiments. If inheritance were perfect, plotting a characteristic of an offspring against the same characteristic in the parent would give points along the diagonal. In reality, offspring tend to "regress" towards the population mean, so that offspring of superior parents tend to be less superior than their parents and offspring of inferior parents tend to be less inferior than their parents, hence the points will lie along a line with slope less than 1. This was called the "regression line" and fitted by least squares. Now, any model fitting with least squares is called "regression".

In the multiple linear regression model, the conditional mean of the dependent variable is a linear function of more than one independent variable.

A categorical independent variable is called a factor. The categories are called the levels of the factor.

Replications are experiment observations made under the same conditions, that is, under the same combination of factor levels.

An experimental design is said to be balanced if each combination of factor levels is replicated the same number of times.

The case where the variance of a random variable is the same in each subpopulation, or at any given level of the covariates, is called homoscedastic. The contrary case is called heteroscedastic.

In analysis of variance, or ANOVA, the sum of squared deviations of the dependent variable about its mean is broken down into a sum of terms, each term a sum of squared deviations representing the variation attributable to an explanatory variable, and the residual, or unexplained, variation.

The terms residual sum of squares or error sum of squares apply to the sum of squared deviations after a model has been fitted, whether or not the model is correct, even if there are possible explanatory variables that have been not been put into the model.

A residual sum of squares based on replication is sometimes called a pure error sum of squares because there is no explanation for the observed variability other than random error.

The degrees of freedom of a sum of squared deviations is the number of squares in the sum, minus the number of fitted parameters in the expected values about which the deviations are computed.

A mean square is a sum of squares divided by its degrees of freedom.

RELATIONS BETWEEN SOME DISTRIBUTIONS...

The Bin(n, p) distribution can be approximated by the Pois(np) distribution when n is large and p is small.

The Bin(n, p) distribution can be approximated by the N(np, np(1-p)) distribution when p < 0.5 and np > 5 or when q < .5 and nq > 5. The continuity correction is recommended.

The Pois(m) distribution can be approximated by the N(m, m) distribution when m > 5. The continuity correction is recommended.

A process in time (or space) where events happen one at a time, at random, independently of each other, at a constant average rate l, is called a Poisson process. The number of events in a fixed time interval of length t follows a Poisson distribution with mean lt. The time between events, or from an arbitrary time to the next event, follows an exponential distribution with mean 1/l.

The relations between the Normal, Chi-square, t and F distributions can be illustrated with the following identities, which you should verify in the tables:

z_p = t_infinity,p

(z_1-p/2)² = (t _{infinity,1-p/2})²= c²_1,1-p = F_{1, infinity,1-p} = 1/F_infinity,1,p

(t_d,1-p/2)²= F_1,d,1-p= 1/F_d,1,p

(c²_d,1-p)/d = F_{d, infinity,1-p} = 1/F_infinity,d,p

SOME DEFINITIONS...

Updated 2010-03-02

RELATIONS BETWEEN SOME DISTRIBUTIONS...

Statistics 3N03