Reads data from a file (P.W. Lane).
|What output to display (
||External name of the data file; no default in batch mode, name is prompted for in interactive mode|
||What string terminates data; default
||What character represents missing values; default
||Number of lines to skip at the start of the file, or string to indicate the record before the first record of data; default 0|
||The maximum number of categories for which a structure is defined to be a factor unless otherwise specified by
||What characters to treat as introducing comments if found in the first column at the start of the file; default double-quote character (
||How identifiers are to be specified for the data structures to be read (
||To store the identifiers, whether read or supplied, and to provide suffixed identifiers for data with no specified identifiers|
||What (single) character separates successive values; default is the space character|
||Names for the data structures that are to be read; these are prompted for if this is unset when running interactively with
||Whether to form each data structure into a factor (
||What representation to assume for each data structure (
FILEREAD reads data from a file into suitable structures determined from the data. It can deal with values laid out as follows.
(1) A character file: that is, a normal readable file, or flat file.
(2) Maximum record length of 200 characters.
(3) Contents consist of values for one or more data structures – usually presented as a single rectangular data matrix.
(4) The values for the data structures are recorded in parallel – that is, the first values of all the structures, followed by the second values of all, and so on; usually, each record of the file contains one value of each structure, but multiple values per record and multiple records for each unit can also be dealt with.
(5) Values in a record are separated from each other by the same separator – usually one or more spaces.
(6) Text values must be enclosed in single quotes if they contain a space, comma, backslash, or double-quote; single-quotes must be used only to enclose textual values, or be duplicated as part of a value which is also enclosed in single quotes.
(7) Comments are allowed at the start of the file only if every record to be treated as a comment starts with a double quote, or other specified symbol. Alternatively, a specified number of records at the start of the file can be skipped, or any number of records up to and including a specified string.
(8) Identifiers for the columns of the matrix can be read from the first row of data, as long as they are valid, unsuffixed, Genstat identifiers. An exclamation mark after an identifier signals that the structure is to be set up as a factor.
Information may be numerical or textual. Numerical values are read as variates, and textual as texts, determined by the values in the first complete record or by the
REPRESENTATION parameter. If this parameter is unset,
FILEREAD searches for the first record in the file with no missing values, and fails if there is no such record. If the
REPRESENTATION parameter is set, it determines whether the values of each structure are to be treated as numbers or characters; if set for any structure, this parameter must be set for all of them.
NAME option of the procedure supplies the name of the file, enclosed in single quotes. In batch mode the name must be supplied, but in interactive mode,
FILEREAD will prompt for the name if it is not supplied.
IMETHOD option controls the specification of identifiers for the structures to be read. With the default,
IMETHOD=supply, the identifiers can be listed using the
IDENTIFIER parameter, one for each column of the data matrix. If
IDENTIFIER is not set when running in interactive mode,
FILEREAD will prompt for identifiers; if it is unset when running in batch mode,
FILEREAD just reports on the contents of the file, unless option
ISAVE is set (see below). If
FILEREAD will attempt to read identifiers for the data structures from the first complete record in the file (and the
IDENTIFIER parameter is ignored). They must be valid Genstat identifiers, and must not include suffixes. If an exclamation mark is found after (or in) an identifier, then the structure will be set up as a factor unless the
FGROUPS parameter is set to
leave. (This convention matches that used when data is read into a Genstat spreadsheet using menus.) If
FILEREAD just reports on the contents of the file without assigning identifiers unless option
ISAVE is set.
ISAVE option can be set to a pointer to store the identifiers read from the file (if
IMETHOD=read) or supplied interactively (if
IMETHOD=none in either mode, or
IMETHOD=supply and the
IDENTIFIER parameter is unset in batch mode, the data structures can be referred to using the pointer.
Values on the same record of a file must be separated from each other by at least one space unless the
SEPARATOR option is set. This option can nominate any single character to be treated as data separator. The
END options specify the missing-value and end-of-file symbols.
If the number of identifiers is not specified, the number of data structures is taken to be the number of values on the first record with no missing values. But if identifiers are supplied in the
IDENTIFIER parameter, or read from the data file, it is possible to read several units of data from each record, or each unit from several records. If there are more values on the first record of data than there are identifiers, the type of each data structure can be determined only by its first value:
FILEREAD will fail if any first value is missing, unless the
REPRESENTATION parameter is set. If there are fewer values on the first record of data than there are identifiers,
FILEREAD will fail regardless of the absence of missing values unless the
REPRESENTATION parameter is set.
FILEREAD reports what structures are set up and tabulates the number of values in each category for structures that have
MAXCATEGORY or less distinct values. It also displays any comments that it identifies before the start of the data, and the first record of data that contains no missing values. These four reports are controlled by the
FGROUPS parameter allows structures to be formed automatically into factors. The default setting is
check: in interactive mode,
FILEREAD then prompts for a decision about any structure where the number of distinct values is less than or equal to the setting of the
MAXCATEGORY option; in batch mode, all structures with these few distinct values become factors automatically.
FGROUPS can also be set to
leave to specify explicitly whether each structure should or should not be defined automatically as a factor. (The settings
leave were introduced in Procedure Library PL21 to remove the confusion arising from the fact that other options and parameters that have no as a setting, use
no as their default. However, for compatibility with earlier programs, the settings
no are still recognised as synonyms for
COMMMENTSYMBOLS option can be set to a list of single characters, in quotes. If any of these characters is found at the start of a record, before any data has been read, that record will be treated as a comment. By default, the double-quote symbol is the only comment symbol, but it must appear at the start of every record to be treated as a comment.
SKIP option allows records at the start of the file to be skipped altogether. It can be set either to the number of records to be skipped, or to a string, indicating that all records are to be skipped up to and including the first record containing that string.
The file is opened on the first free input channel. The first record is read as a single string, and then individual items are read from the string into a text. This is tested, and the process repeated until a record has been found that is not blank or a comment, and has no missing items. Items are tested to determine if they are valid numbers, and then the whole file is read into variates and texts as appropriate. Each structure is grouped to provide information about numbers of categories.
Commands for: Input and output.
CAPTION 'FILEREAD example',\ !t('No example can be provided for FILEREAD because',\ 'it needs an external datafile. However, the examples',\ 'below show ways of calling the procedure.'); STYLE=meta,plain SCALAR Chan ENQUIRE Chan; FILETYPE=output; OUTSTYLE=Style OUTPUT [STYLE=plain] PRINT [SQUASH=yes] '\ FILEREAD - In interactive mode, the procedure will prompt for a file name and identifiers, and whether to turn into factors any structures found to have 10 or fewer categories. This statement will fail in batch mode. FILEREAD [NAME=''abc.dat''] - In interactive mode, as above for file abc.dat. In batch mode, report the contents of file abc.dat; however the data cannot then be referred to as no identifiers are given. FILEREAD [NAME=''abc.dat''; IMETHOD=read; ISAVE=data] - In either mode, read identifiers from the first complete data record, then deal with factors as above. The data can be referred to either by the identifiers read, or as data, data and so on. FILEREAD [PRINT=*; NAME=''abc.dat''; SKIP=5; COMMENT=''!''] A,B,C - Read the data in file abc.dat into variates or text structures called A, B and C, without any reports. The first five records are skipped, and any subsequent records beginning with exclamation mark until an uncommented record with data is found. Formation of factors is dealt with as above. FILEREAD [NAME=''abc.dat''] X,Y,F,Z; FGROUPS=leave,leave,form,leave - Read the data in file abc.dat into data structures called X, Y, F and Z, redefining F to be a factor. FILEREAD [NAME=''abc.dat''; SEPARATOR='',''] A,B; REPRESENT=characters - Read the data in file abc.dat into data structures called A and B, assuming that values on the same record are separated by commas, and that both structures are to be texts. Each record may contain any number of values of the structures, as long as they are in parallel: that is, one for A, then one for B, and so on. FILEREAD [NAME=''abc.dat''; SKIP=''DATA3''; END=''DATA''] A,B - Read the data in file abc.dat into data structures called A and B, starting from the first record after the record containing the string DATA3, and finishing with the next record containing the string DATA. '; JUST=left; JUST=left; SKIP=0 OUTPUT [STYLE=#Style] SKIP [FILETYPE=output] 1