Signal and noise for a variable

When considering variation in one variable, the ideas of signal and noise are better explained in terms of explained and unexplained variation.

Explained variation
This is the amount that other variables in the data set explain differences between the values of the variable of interest — the signal.
Unexplained variation
Some differences between the values cannot be explained in terms of the changing values of other variables in the data set. The unexplained variation is often noise.

The ideas of explained and unexplained variation are appropriate to most data sets, whether they arise from experiments or surveys.

Data sets where all variation is unexplained

In some data sets, none of the variation in the variable of interest can be explained in terms of other variables that have been recorded.

Hotel no-shows and late cancellations

In hotels, a proportion of people who reserve rooms never show up (called 'no-shows') or cancel at the last moment. Large hotels therefore overbook under the assumption that a proportion of these 'no-shows' and cancellations will free up space. The table below shows numbers of no-shows and late cancellations in a 500-room hotel during a 30-day period when the hotel had over 90 percent occupancy.

No-shows & late cancellations
25
22
19
24
24
31
25
20
24
18
27
27
20
22
33
28
17
24
23
20
25
23
23
17
20
32
25
28
18
27

There is considerable variation in the numbers, but no other information is available to help explain this variation. For a statistical analysis, we therefore have no choice but to treat all variation in these data as 'unexplained'.

Data sets with explained and unexplained variation

In other data sets, some of the variation in the variable of interest can be explained in terms of other variables whose values are available, but part of the variation remains unexplained.

A statistical analysis often separates and describes these two components of the variation. Both provide useful information.

Experiment: Surface finish from lathe

A mechanical engineer is investigating the surface finish of metal parts produced on a lathe and its relationship to the speed (in RPM) of the lathe. Twenty parts were produced at different lathe speeds and using two different types of cutting tool (code numbers 302 and 416).

Surface finish     RPM     Type of cutting tool
45.44
42.03
50.10
48.75
47.92
47.79
52.26
50.52
45.58
44.78
33.50
31.23
37.52
37.13
34.70
33.92
32.13
35.47
33.49
32.29
225
200
250
245
235
237
265
259
221
218
224
212
248
260
243
238
224
251
232
216
302
302
302
302
302
302
302
302
302
302
416
416
416
416
416
416
416
416
416
416

A large part of the variability in surface finish can be explained by differences in RPM and the cutting tool that was used. However some variability in surface finish remains that cannot be explained by these variables — unexplained variation.

Non-experimental data: Alcoholism and strength

Data that were obtained from 50 alcoholic men who were selected from a larger group of alcoholics to be as similar as possible in age and social characteristics. The researchers estimated the total lifetime alcohol consumption (kg per kg body weight) of each individual and measured the strength of a muscle (kg) in that individual's non-dominant arm.

  Alcohol Strength      Alcohol Strength      Alcohol Strength
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
3.5
4.0
5.2
5.2
7.4
9.4
9.7
9.8
10.8
11.1
11.7
12.5
12.6
13.2
13.5
13.7
14.0
22.3
20.9
20.9
28.2
29.5
28.2
23.9
22.1
25.1
24.0
20.9
20.9
26.2
15.5
28.4
20.9
21.8
  18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
14.0
14.8
15.7
17.4
17.5
17.7
18.2
18.3
18.9
19.1
19.1
19.7
20.0
22.6
22.8
27.7
28.3
25.1
15.5
20.9
20.9
25.1
19.1
12.2
22.2
21.1
17.9
28.2
22.2
21.1
26.3
18.8
18.2
16.2
  35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
28.3
28.6
29.2
29.8
30.8
32.3
32.5
32.9
34.5
34.5
36.2
39.5
39.7
39.7
40.3
40.8
19.3
15.2
13.1
23.3
15.2
21.2
14.0
24.1
15.2
16.2
10.0
10.8
10.0
15.5
16.2
18.2

(Note that the data matrix has 50 rows and 2 columns — it has been split here to fit better on the screen. Note also that the rows of the data matrix — individuals — have been sorted into order of lifetime alcohol consumption. Any ordering of the individuals is equally valid.)

Some of the variation in strength can be explained by the different amounts of alcohol consumed by these men — there is a tendency for those with low alcohol consumption to be stronger. However there is a lot of variation in strength that cannot be explained by differences in alcohol consumption.