Surveys
Some types of data are obtained from experiments such as the tomato-growing experiment on the previous page. In experiments, we actively change some characteristics of each individual — for example adding fertiliser to half of the tomato plants in the previous page.
Other types of data are obtained by selecting a sample of individuals in some way and simply recording information about them. This type of data collection is called a survey and survey data sets are often so large that it is difficult to get any useful information from the raw data in the data matrix.
Summarising data
To understand survey data, it is common to calculate numerical values that summarise aspects of the data. Such summary values are called statistics.
We will describe several useful summary statistics in later chapters of CAST.
An important consequence of the natural variability in data is that summary statistics that are calculated from the data must also be considered as random quantities — similarly collected data would result in different values for the statistics.
A major role of statistics is to understand and describe the randomness of such summary statistics.
The example below illustrates this variability with a very small survey.
Preferences
Food manufacturers often collect 'preference data' to discover which of several recipes for a type of product is preferred by consumers. In the example below, 30 consumers were each given samples of strawberry yoghurt with three different types of strawberry flavouring, A, B and C. Each consumer reported the flavouring that was preferred.
The raw data matrix consists of 30 rows (corresponding to the individuals sampled) and one column (a categorical variable with values A, B or C corresponding to the preference for each individual). This can be summarised by the percentage preferring each of the three brands.
Click Another 30 people a few times to repeat the data collection with different groups of 30 people. The natural variability of the people implies that the proportion preferring each flavouring varies. A full analysis of the data should take account of the variability in the proportion preferring flavouring A.
How much should you trust a
statistic that would be different if you had used different people?
(This e-book should help you to give an answer!)