FILEREAD procedure

Reads data from a file (P.W. Lane).

Options

`PRINT` = string tokens	What output to display (`summary`, `groups`, `comments`, `firstline`); default `summ`, `grou`, `comm`, `firs`
`NAME` = text	External name of the data file; no default in batch mode, name is prompted for in interactive mode
`END` = text	What string terminates data; default `':'` (the end of file also terminates data for any setting); the setting `END=*` is not allowed
`MISSING` = text	What character represents missing values; default `'*'`
`SKIP` = scalar or text	Number of lines to skip at the start of the file, or string to indicate the record before the first record of data; default 0
`MAXCATEGORY` = number	The maximum number of categories for which a structure is defined to be a factor unless otherwise specified by `FGROUPS`; default 10
`COMMENTSYMBOLS` = text	What characters to treat as introducing comments if found in the first column at the start of the file; default double-quote character (`"`)
`IMETHOD` = string token	How identifiers are to be specified for the data structures to be read (`supply`, `read`, `none`); default `supp`
`ISAVE` = pointer	To store the identifiers, whether read or supplied, and to provide suffixed identifiers for data with no specified identifiers
`SEPARATOR` = text	What (single) character separates successive values; default is the space character

Parameters

`IDENTIFIER` = identifiers	Names for the data structures that are to be read; these are prompted for if this is unset when running interactively with `IMETHOD=supply`; identifiers are redefined if they have been used previously
`FGROUPS` = string tokens	Whether to form each data structure into a factor (`check`, `form`, `leave`); default `chec`, which causes `FILEREAD` when running interactively to ask about any structure whose number of distinct values is less than or equal to `MAXCATEGORY`, and when running in batch to define as factors all structures with `MAXCATEGORY` or fewer distinct values (note: for compatibility with earlier releases, `yes` and `no` can be used as synonyms of `form` and `leave`)
`REPRESENTATION` = string tokens	What representation to assume for each data structure (`numbers`, `characters`); default unset – representation is determined by whether the first value is a number; if set for one structure, this parameter must be set for all structures

Description

FILEREAD reads data from a file into suitable structures determined from the data. It can deal with values laid out as follows.

(1) A character file: that is, a normal readable file, or flat file.

(2) Maximum record length of 200 characters.

(3) Contents consist of values for one or more data structures – usually presented as a single rectangular data matrix.

(4) The values for the data structures are recorded in parallel – that is, the first values of all the structures, followed by the second values of all, and so on; usually, each record of the file contains one value of each structure, but multiple values per record and multiple records for each unit can also be dealt with.

(5) Values in a record are separated from each other by the same separator – usually one or more spaces.

(6) Text values must be enclosed in single quotes if they contain a space, comma, backslash, or double-quote; single-quotes must be used only to enclose textual values, or be duplicated as part of a value which is also enclosed in single quotes.

(7) Comments are allowed at the start of the file only if every record to be treated as a comment starts with a double quote, or other specified symbol. Alternatively, a specified number of records at the start of the file can be skipped, or any number of records up to and including a specified string.

(8) Identifiers for the columns of the matrix can be read from the first row of data, as long as they are valid, unsuffixed, Genstat identifiers. An exclamation mark after an identifier signals that the structure is to be set up as a factor.

Information may be numerical or textual. Numerical values are read as variates, and textual as texts, determined by the values in the first complete record or by the REPRESENTATION parameter. If this parameter is unset, FILEREAD searches for the first record in the file with no missing values, and fails if there is no such record. If the REPRESENTATION parameter is set, it determines whether the values of each structure are to be treated as numbers or characters; if set for any structure, this parameter must be set for all of them.

The NAME option of the procedure supplies the name of the file, enclosed in single quotes. In batch mode the name must be supplied, but in interactive mode, FILEREAD will prompt for the name if it is not supplied.

The IMETHOD option controls the specification of identifiers for the structures to be read. With the default, IMETHOD=supply, the identifiers can be listed using the IDENTIFIER parameter, one for each column of the data matrix. If IDENTIFIER is not set when running in interactive mode, FILEREAD will prompt for identifiers; if it is unset when running in batch mode, FILEREAD just reports on the contents of the file, unless option ISAVE is set (see below). If IMETHOD=read, FILEREAD will attempt to read identifiers for the data structures from the first complete record in the file (and the IDENTIFIER parameter is ignored). They must be valid Genstat identifiers, and must not include suffixes. If an exclamation mark is found after (or in) an identifier, then the structure will be set up as a factor unless the FGROUPS parameter is set to leave. (This convention matches that used when data is read into a Genstat spreadsheet using menus.) If IMETHOD=none, FILEREAD just reports on the contents of the file without assigning identifiers unless option ISAVE is set.

The ISAVE option can be set to a pointer to store the identifiers read from the file (if IMETHOD=read) or supplied interactively (if IMETHOD=supply). If IMETHOD=none in either mode, or IMETHOD=supply and the IDENTIFIER parameter is unset in batch mode, the data structures can be referred to using the pointer.

Values on the same record of a file must be separated from each other by at least one space unless the SEPARATOR option is set. This option can nominate any single character to be treated as data separator. The MISSING and END options specify the missing-value and end-of-file symbols.

If the number of identifiers is not specified, the number of data structures is taken to be the number of values on the first record with no missing values. But if identifiers are supplied in the IDENTIFIER parameter, or read from the data file, it is possible to read several units of data from each record, or each unit from several records. If there are more values on the first record of data than there are identifiers, the type of each data structure can be determined only by its first value: FILEREAD will fail if any first value is missing, unless the REPRESENTATION parameter is set. If there are fewer values on the first record of data than there are identifiers, FILEREAD will fail regardless of the absence of missing values unless the REPRESENTATION parameter is set.

By default, FILEREAD reports what structures are set up and tabulates the number of values in each category for structures that have MAXCATEGORY or less distinct values. It also displays any comments that it identifies before the start of the data, and the first record of data that contains no missing values. These four reports are controlled by the PRINT option.

The FGROUPS parameter allows structures to be formed automatically into factors. The default setting is check: in interactive mode, FILEREAD then prompts for a decision about any structure where the number of distinct values is less than or equal to the setting of the MAXCATEGORY option; in batch mode, all structures with these few distinct values become factors automatically. FGROUPS can also be set to form or leave to specify explicitly whether each structure should or should not be defined automatically as a factor. (The settings form or leave were introduced in Procedure Library PL21 to remove the confusion arising from the fact that other options and parameters that have no as a setting, use no as their default. However, for compatibility with earlier programs, the settings yes and no are still recognised as synonyms for form and leave.)

The COMMMENTSYMBOLS option can be set to a list of single characters, in quotes. If any of these characters is found at the start of a record, before any data has been read, that record will be treated as a comment. By default, the double-quote symbol is the only comment symbol, but it must appear at the start of every record to be treated as a comment.

The SKIP option allows records at the start of the file to be skipped altogether. It can be set either to the number of records to be skipped, or to a string, indicating that all records are to be skipped up to and including the first record containing that string.

Options: PRINT, NAME, END, MISSING, SKIP, MAXCATEGORY, COMMENTSYMBOLS, IMETHOD, ISAVE, SEPARATOR.

Parameters: IDENTIFIER, FGROUPS, REPRESENTATION.

Method

The file is opened on the first free input channel. The first record is read as a single string, and then individual items are read from the string into a text. This is tested, and the process repeated until a record has been found that is not blank or a comment, and has no missing items. Items are tested to determine if they are valid numbers, and then the whole file is read into variates and texts as appropriate. Each structure is grouped to provide information about numbers of categories.

Example

CAPTION 'FILEREAD example',\
        !t('No example can be provided for FILEREAD because',\
        'it needs an external datafile. However, the examples',\
        'below show ways of calling the procedure.'); STYLE=meta,plain
SCALAR  Chan
ENQUIRE Chan; FILETYPE=output; OUTSTYLE=Style
OUTPUT  [STYLE=plain]
PRINT   [SQUASH=yes] '\
  FILEREAD
     - In interactive mode, the procedure will prompt for a file
       name and identifiers, and whether to turn into factors any
       structures found to have 10 or fewer categories.
       This statement will fail in batch mode.
  FILEREAD [NAME=''abc.dat'']
     - In interactive mode, as above for file abc.dat.
       In batch mode, report the contents of file abc.dat; however the
       data cannot then be referred to as no identifiers are given.
  FILEREAD [NAME=''abc.dat''; IMETHOD=read; ISAVE=data]
     - In either mode, read identifiers from the first complete data
       record, then deal with factors as above. The data can be referred
       to either by the identifiers read, or as data[1], data[2] and so on.
  FILEREAD [PRINT=*; NAME=''abc.dat''; SKIP=5; COMMENT=''!''] A,B,C
     - Read the data in file abc.dat into variates or text structures
       called A, B and C, without any reports. The first five records are
       skipped, and any subsequent records beginning with exclamation mark
       until an uncommented record with data is found. Formation of factors
       is dealt with as above.
  FILEREAD [NAME=''abc.dat''] X,Y,F,Z; FGROUPS=leave,leave,form,leave
     - Read the data in file abc.dat into data structures called X, Y, F
       and Z, redefining F to be a factor.
  FILEREAD [NAME=''abc.dat''; SEPARATOR='',''] A,B; REPRESENT=characters
     - Read the data in file abc.dat into data structures called A and B,
       assuming that values on the same record are separated by commas,
       and that both structures are to be texts. Each record may contain
       any number of values of the structures, as long as they are in
       parallel: that is, one for A, then one for B, and so on.
  FILEREAD [NAME=''abc.dat''; SKIP=''DATA3''; END=''DATA''] A,B
     - Read the data in file abc.dat into data structures called A and B,
       starting from the first record after the record containing the string
       DATA3, and finishing with the next record containing the string DATA.
       '; JUST=left; JUST=left; SKIP=0
OUTPUT [STYLE=#Style]
SKIP   [FILETYPE=output] 1

Updated on March 8, 2019

Was this article helpful?

Yes No