Long page
descriptions

Chapter 1   Introduction: About Data

1.1   Context

1.1.1   Importance of context

The focus of statistics is to answer questions that are expressed in the language of some application area. Statistical methods for analysis of data are a core part of statistics, but the context of the data is most important.

1.1.2   Statistics

Statistical analysis is a process that involves identifying the questions of interest, data collection and analysis and producing a report. In real-life problems, the data collection and analysis steps may be repeated more than once.

1.1.3   The statistical process

In many applications, the cycle of data collection and analysis is a central part of the quest for improvement to systems and processes.

1.2   Standard data structures

1.2.1   Variables and individuals

Most data sets contain one or more measurements from each of a collection of 'individuals' (also called 'cases' or 'units').

1.2.2   Types of variable

Variables are classified into numerical and categorical variables. A finer classification is also sketched.

1.2.3   Categorical variables and groups

A categorical variable can be used to split the 'individuals' into groups. Equivalently, grouped data can be represented in a data matrix with a categorical variable.

1.2.4   Meaningful variables

Sometimes a ratio or difference of two variables in a data matrix is easier to interpret than the original variables.

1.2.5   Time-ordered data

In some data matrices, the rows are time-ordered.

1.2.6   Multi-level data ((advanced))

Sometimes information is available at both group and individual level — multi-level data. These data are most naturally stored in two data matrices.

1.2.7   Structure of the e-book

Statistical analysis is specific to the structure of the data (i.e. the types of variable in the data matrix). CAST starts with descriptive methods to explore data; it then moves on to inferential methods that take account of randomness in the data.

1.3   Variation

1.3.1   Signal and noise

In many situations, information (signal) can be obscured by random variation (noise).

1.3.2   Natural variability

When data are collected from 'individuals', they often vary considerably.

1.3.3   Variability caused by experiments

Intentional differences to experimental conditions may also cause systematic differences in variables. Natural variability makes it harder to interpret experimental results.

1.3.4   Variability in survey data

The natural variability of individuals also makes it harder to interpret information from surveys.

1.3.5   Explained and unexplained variation

Some variation in a variable can be explained in terms of other recorded variables. Other variation is a result of natural variability in the individuals.

1.3.6   Predicting future variation

Variation in a data set can help us to predict the values that might occur if further data of the same kind are collected in the future.