If you don't want to print now,

Chapter 1   Introduction: About Data

1.1   Context

1.1.1   Importance of context

What is the purpose of Statistics?

When non-statisticians think of statistics, the first thing that usually comes to mind is data. Large amounts of economic, social and administrative data are routinely collected and published. Most researchers also collect data. Statistical analysis uses data, but the data are not the goal.

Data are the basic commodity of the statistics. Without data, there is no information on which to reach conclusions or base decisions.

Data contain information.
The purpose of statistics is to extract information from data.

Large data sets must be summarised before patterns and relationships can be seen. In smaller data sets, the problem is usually that there is not enough information to get a clear answer to questions of importance. Statistical methods are needed to describe precision and to ensure that the highest precision is obtained.

Context

In general, users of statistics are neither interested in data or in statistical methods, but are only interested in questions in their own subject area.

The aim of statistics is to supply useful information to people whose main area of expertise is not statistics. Statistical methods are only useful if they can extract information from data to help answer discipline-specific questions. The underlying context is therefore the most important aspect of any statistical analysis.

If you are not primarily a statistician, you will appreciate statistical methods when they are needed in your career!

1.1.2   Statistics

Simple series of steps

The simplest application of statistics addresses a single question in the context of some practical subject area.

Statistics has a role to play in all stages of this process.

Feedback

The initial question is usually less well defined and a single pass through the process is not enough.

An example of feedback arises when a small fraction of the data is initially collected and analysed. The information obtained from this pilot study is used to refine the data collection process.

Another example of feedback occurs when the initial analysis reveals unusual or unexpected features in the data. Such features may suggest further questions and therefore further data collection.

1.1.3   The statistical process

Continuous quality improvement

Statistical analysis is an important part of long-term monitoring and improvement of the performance of many types of system. This process is often called continuous quality improvement.

The statistical part of the process again involves a feedback cycle of data collection and analysis, aimed at improving aspects of the system.

The Plan-Do-Check-Act cycle is most often used in commerce and industry, but can also be used to 'improve' many biological and other systems.

1.2   Standard data structures

1.2.1   Variables and individuals

Data structure

Context is critically important, but the statistical methods that can be used on data depend mostly on the internal structure of the data.

Employees
 Gender   Age 
Male
Male
Male
Male
Male
Male
Female
Female
Female
Female
Female
Female
62
46
51
29
44
30
52
55
28
63
41
33
Telephone survey
Location
 of district 
Response
rate
Rural
Rural
Rural
Rural
Rural
Rural
Urban
Urban
Urban
Urban
Urban
Urban
62
46
51
29
44
30
52
55
28
63
41
33
Camp site use on 12 days
Weather Number
of tents
Dry
Dry
Dry
Dry
Dry
Dry
Wet
Wet
Wet
Wet
Wet
Wet
62
46
51
29
44
30
52
55
28
63
41
33

These three data sets have the same basic structure, so the same statistical methods can be applied to all of them.

Variables and 'individuals'

Most data sets have a fairly simple structure. One or more measurements ('variables') are recorded from each of a collection of 'individuals' (also called 'cases' or 'units'). The data can be presented in a data matrix.

1.2.2   Types of variable

Most variables in a data set are either numerical or categorical.

Numerical variables

These have values that are numbers and can be further classified into:

Discrete numerical variable
A variable whose values are whole numbers (counts) is called discrete.
Continuous numerical variable
A variable that may contain any value within some range is called continuous.

Statistical methods that can be used for continuous variables are not always appropriate for discrete variables.

Categorical variables

The values of a categorical variable are selected from a small group of categories. A further classification is:

Ordinal categorical variable
This arises when the categories can be meaningfully ordered.
Nominal categorical variable
If it does not matter which way its categories are ordered, the variable is called nominal.

Most statistical methods for categorical data can be applied to both ordinal and nominal variables.

Labels

In some data sets, each individual has a unique 'name' that can be used to identify it. We call this a label variable.

Warning!

When you see a column of numbers in your data matrix, do not assume that it is a numerical variable.

Numbers are sometimes used as codes for categorical or label variables.

1.2.3   Categorical variables and groups

Categorical variables and groups

A categorical variable can be used to split the individuals in a data set into groups. We might treat individuals with values "A", "B", etc. as belonging to different groups.

Conversely, if data were separately collected from different groups of individuals, the resulting data sets could be combined with a categorical variable distinguishing between the groups. Its values might be defined as "A", "B", etc. to identify the group membership of any individual.

A categorical variable and groups are often two ways of representing the same data.

Data presented in a separate list for each group are called unstacked whereas if the data are presented as a single list alongside a categorical variable, they are called stacked.

1.2.4   Meaningful variables

Defining new variables

When given a data set to analyse, think about whether its variables are the most useful ones to analyse. Sometimes a simple transformation provides a variable whose values are more meaningful or highlight a different aspect of the data.

For example, although the raw GDP of countries indicates their relative importance, their GDP per capita gives better comparison of the relative wealths of individuals in these countries:

 

GDP per capita   =   Country's total GDP
Population

As another example, although it is interesting to compare the calorie intake of countries, it is also appropriate to look at their percentage change over different periods:

%Increase   =   100  ×   1998 calories − 1993 calories
1993 calories

1.2.5   Time-ordered data

Time series

Many basic statistical methods assume that the 'individuals' in a data matrix are unordered — any rearrangement of the rows would give the same information.

However sometimes the rows of the data matrix are ordered, usually by time. These kinds of data are called time series.

The ordering of values can be described by an extra 'time' or 'ordering' variable in the data matrix.

1.2.6   Multi-level data ((advanced))

Clusters of 'individuals'

In some data sets, the basic 'individuals' are arranged in fairly small groups or clusters. For example, surveys are often conducted by sampling households, then recording information from every member of each household (a 'household survey').

A categorical variable could distinguish between the groups, but its values (e.g. the houshold names) would be of little direct interest.

Data at different levels

Some measurements are usually recorded at group level rather than individual level. These values could be stored in a separate group-level data matrix.

Information can be exchanged between the two data matrices in order to analyse both sets of data together.

Group level —> individual level
The values of group-level variables could be copied into new variables in the individual-level data matrix.
Individual level —> group level
Summaries from each group of individuals, such as household size or maximum age could be added to the group-level data matrix.

Information can be obtained from multi-level data by examining both the group-level and individual-level data matrices.

Properly analysing multi-level data and interpreting the results of the analysis require a lot of careful thought!

1.2.7   Structure of the e-book

Exploratory data analysis

The initial chapters of this e-book describe graphical and numerical ways to explore and summarise data. Appropriate methods depend on the structure of the data set — the number and types of its variables.

Data collection

Statisticians should be involved before any data are collected. Statistical principles can be applied to the data collection process that ensure that the resulting data can be meaningfully analysed. Chapters 7 and 8 explore the idea of random sampling and describe some principles that should be followed in data collection.

Inference

To fully understand the information that is contained in most data sets, we must take account of randomness — if we collected the data again, the values would often be different. The relevant statistical methods are collectively called inference. Again, the details of the statistical analysis depend mostly on the structure of the data set — the number and types of its variables.

1.3   Variation

1.3.1   Signal and noise

Signal and noise

Electronics and telecommunications engineers distinguish between the signal that is being communicated between two locations and the random noise that is added by the communications channel. The noise degrades the signal and, in the worst cases, can make the signal difficult to detect.

The word "CAST" is hard to read in the following noisy image.

Audio equipment often quotes its signal-to-noise ratio as a measure of quality. Noise again deteriorates the signal — the 'perfect' music that you want to hear.

Signal   =   information you want
Noise   =   'random' modification to the signal

The concepts of signal and noise also apply to data sets.

1.3.2   Natural variability

Reasons for variability

Most statistical data sets contain measurements from a collection of 'individuals'. These individuals are not identical so measurements made from them also vary from individual to individual.

Even when the 'individuals' are very similar, recorded measurements from them often vary due to:

1.3.3   Variability caused by experiments

Intentional differences

In experiments, different 'individuals' are given different experimental treatments with the intention of comparing these treatments. We hope to find whether changing the experimental treatment causes differences in the measurements.

Signal and noise in experimental data

In experimental data sets, the variability caused by different experimental conditions is the signal in the data since the intention of the experiment is to determine the effect of these differences.

However the signal in the data is usually obscured by the natural variability in the data — the noise in the data set.

In the experimental data on the right, it is difficult to assess whether the fertiliser has increased plant yield because of the natural variability between plants.

   Yield    Group
6.57
6.53
4.71
5.32
6.15
5.08
6.17
4.93
3.16
4.57
6.13
7.49
6.00
4.84
9.01
6.11
5.55
5.48
6.84
6.18
No fertiliser
No fertiliser
No fertiliser
No fertiliser
No fertiliser
No fertiliser
No fertiliser
No fertiliser
No fertiliser
No fertiliser
Fertiliser
Fertiliser
Fertiliser
Fertiliser
Fertiliser
Fertiliser
Fertiliser
Fertiliser
Fertiliser
Fertiliser

Statistics tries to estimate the signal from data that also contain noise.

1.3.4   Variability in survey data

Surveys

Some types of data are obtained from experiments. In experiments, we actively change some characteristics of each individual — choosing an experimental treatment.

Other types of data are obtained by selecting a sample of individuals in some way and simply recording information about them — a survey.

Summarising data

Survey data sets are often large and are summarised by numerical values called summary statistics.

Natural variability between individuals means that summary statistics must be considered as random quantities — similarly collected data would result in different values.

A important role of statistics is to understand and describe the randomness of such summary statistics.

1.3.5   Explained and unexplained variation

Types of variation

The ideas of signal and noise correspond to explained and unexplained variation in a variable, X.

Explained variation
This is the amount that other variables in the data set explain differences between the values of X — the signal.
Unexplained variation
Some differences between the values of X cannot be explained in terms of the changing values of other variables in the data set. The unexplained variation is noise.

In some data sets, none of the variation in X can be explained in terms of other variables that have been recorded. In other data sets, some of the variation in X can be explained in terms of other variables whose values are available, but part of its variation remains unexplained.

A statistical analysis often separates and describes these two components of the variation. Both provide useful information.

1.3.6   Predicting future variation

Variation and prediction

In earlier pages of this section, we treated unexplained variation in data as 'noise' — a nuisance that cannot be avoided and that only serves to complicate the analysis of data.

This is not totally true. Variation can be interesting in its own right, especially when we are interested in predicting the future. If we can assume that there will be no systematic change to the process that generated the observed values in our data,

... the proportion of times that an event happened in the past indicates the chance that it will happen in the future.