Signal and noise for a variable

When considering variation in one variable, the ideas of signal and noise are better explained in terms of explained and unexplained variation.

Explained variation
This is the amount that other variables in the data set explain differences between the values of the variable of interest — the signal.
Unexplained variation
Some differences between the values cannot be explained in terms of the changing values of other variables in the data set. The unexplained variation is often noise.

The ideas of explained and unexplained variation are appropriate to most data sets, whether they arise from experiments or surveys.

Data sets where all variation is unexplained

In some data sets, none of the variation in the variable of interest can be explained in terms of other variables that have been recorded.

Strength measurements

In an ergonomic study involving a group of 41 male students from the University of Hong Kong, each student was asked to exert maximum upward force on a horizontal bar which was close to floor level, with his feet 400mm away from the bar. The force, averaged over a 5-second period, is called the 'maximum voluntary isometric strength' (MVIS) and is shown below in kilograms.

MVIS (kg)
33
16
35
33
47
40
18
54
18
44
21
29
12
12
26
10
12
20
31
12
19
36
23
22
20
15
15
16
20
13
25
26
41
14
20
22
19
18
23
26
19

There is considerable variation in these strength measurements. Although some of this variability will undoubtedly be associated with other physical characteristics of the students, no other measurements were taken from the students. For a statistical analysis of these data, we have no choice but to treat all variation in these data as 'unexplained'.

Data sets with explained and unexplained variation

In other data sets, some of the variation in the variable of interest can be explained in terms of other variables whose values are available, but part of the variation remains unexplained.

A statistical analysis often separates and describes these two components of the variation. Both provide useful information.

Experiment: Surface finish from lathe

A mechanical engineer is investigating the surface finish of metal parts produced on a lathe and its relationship to the speed (in RPM) of the lathe. Twenty parts were produced at different lathe speeds and using two different types of cutting tool (code numbers 302 and 416).

Surface finish     RPM     Type of cutting tool
45.44
42.03
50.10
48.75
47.92
47.79
52.26
50.52
45.58
44.78
33.50
31.23
37.52
37.13
34.70
33.92
32.13
35.47
33.49
32.29
225
200
250
245
235
237
265
259
221
218
224
212
248
260
243
238
224
251
232
216
302
302
302
302
302
302
302
302
302
302
416
416
416
416
416
416
416
416
416
416

A large part of the variability in surface finish can be explained by differences in RPM and the cutting tool that was used. However some variability in surface finish remains that cannot be explained by these variables — unexplained variation.

Non-experimental data: Alcoholism and strength

Data that were obtained from 50 alcoholic men who were selected from a larger group of alcoholics to be as similar as possible in age and social characteristics. The researchers estimated the total lifetime alcohol consumption (kg per kg body weight) of each individual and measured the strength of a muscle (kg) in that individual's non-dominant arm.

  Alcohol Strength      Alcohol Strength      Alcohol Strength
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
3.5
4.0
5.2
5.2
7.4
9.4
9.7
9.8
10.8
11.1
11.7
12.5
12.6
13.2
13.5
13.7
14.0
22.3
20.9
20.9
28.2
29.5
28.2
23.9
22.1
25.1
24.0
20.9
20.9
26.2
15.5
28.4
20.9
21.8
  18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
14.0
14.8
15.7
17.4
17.5
17.7
18.2
18.3
18.9
19.1
19.1
19.7
20.0
22.6
22.8
27.7
28.3
25.1
15.5
20.9
20.9
25.1
19.1
12.2
22.2
21.1
17.9
28.2
22.2
21.1
26.3
18.8
18.2
16.2
  35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
28.3
28.6
29.2
29.8
30.8
32.3
32.5
32.9
34.5
34.5
36.2
39.5
39.7
39.7
40.3
40.8
19.3
15.2
13.1
23.3
15.2
21.2
14.0
24.1
15.2
16.2
10.0
10.8
10.0
15.5
16.2
18.2

(Note that the data matrix has 50 rows and 2 columns — it has been split here to fit better on the screen. Note also that the rows of the data matrix — individuals — have been sorted into order of lifetime alcohol consumption. Any ordering of the individuals is equally valid.)

Some of the variation in strength can be explained by the different amounts of alcohol consumed by these men — there is a tendency for those with low alcohol consumption to be stronger. However there is a lot of variation in strength that cannot be explained by differences in alcohol consumption.