IMPORT procedure

Reads data from a foreign file format, and loads it or converts it to a spreadsheet file (D.B. Baird).

Options

`PRINT` = string token	What information to print (`catalogue`, `summary`); default `cata`
`OUTTYPE` = string token	Output file type (`GEN`, `GSH`, `GWB`, `XLS`, `XLSX`, `TXT`, `CSV`, `SHEETS`); default `GWB`
`METHOD` = string token	Whether to load data into the Genstat server after creating the file, or whether merely to create the file (`create`, `load`); default `load`
`IMETHOD` = string token	How identifiers are to be specified for the columns (`read`, `supply`, `none`); default `supply` if `COLUMNS` is set (and specifies names rather than just types), otherwise `read`
`ENDSTATEMENT` = string token	Ending statement for a type `GEN` output file (`return`, `endbreak`); default `retu`
`SPSSMV` = string token	What to do with SPSS missing value codes (`ignore`, `convert`); default `conv`
`MISSING` = text	What labels represent missing values in Excel, Quattro or Lotus files; default `'*'`
`FORDER` = string token	The order in which to define the labels or levels of a factor read in (`sorted`, `unsorted`); default `sort`
`TEXTCONVERSION` = string token	How to convert text to numbers for the columns (`strict`, `single`, `common`, `standard`, `lax`); default `stan`
`KEEPEMPTY` = string tokens	Whether to retain any empty rows or columns found in the data (`rows`, `columns`, `none`); default `none`
`NAMEROW` = scalar	The row number within an Excel or Quattro spreadsheet which contains the column names (`IMETHOD` must be unset or set to `read`); default, the first row in `CELLRANGE`
`EMETHOD` = string token	Whether to read column descriptions/extra from Excel, SigmaPlot or Quattro spreadsheets (`read`, `none`); default `none`
`EXTRAROW` = scalar	The row number within an Excel or Quattro spreadsheet which contains the column descriptions (`EMETHOD` must be set to `read`); default, the second row in `CELLRANGE`
`PREFIX` = text	The string with which to prefix numerical column names; default `'%'`
`TEMPMISSING` = string token	Whether to read in temporarily missing values as missing (`yes`, `no`); default `no`
`INOPTIONS` = text	Optional input file arguments to be passed to the `Dataload.dll`
`OUTOPTIONS` = text	Optional output file arguments to be passed to the `Dataload.dll`
`RGBMETHOD` = string token	How to read colour values (`combined`, `separate`, `matrix`); default `sepa`
`SEPARATORS` = text	Alternative separators to use in text or csv files
`SCOPE` = string token	Whether to create the data locally in a procedure that is using `IMPORT`, or globally in the whole program (`local`, `global`); default `loca`
`IPREFIX` = text	Prefix to use with unnamed columns, default `'C'`
`TRANSPOSE` = string token	Whether to transpose the rows and columns of the input file (`yes`, `no`); default `no`
`UNICODE` = string token	What to do with Unicode characters found e.g. in an Excel `XLSX` input file (`utf8`, `typeset`, `ascii`, `remove`); default `utf8`
`COLUNICODENAMES` = string token	How to convert Unicode column names (`suffix`, `extra`, `ignore`) default `suff`
`UNINAME` = text	Name of the pointer for Unicode column names used as suffixes; default ‘`C`‘
†`XLSXCONTENT` = string tokens	What content to read from an Excel XLSX file (`values`, `formulae`, `forecolour`, `backcolour`, `fontname`, `style`, `size`); default `valu`

Parameters

`FILE` = texts	Input file or URL to be read
`OUTFILE` = texts	Name of the output file to be created; if this is not provided a temporary file will be created, and then deleted if the data is loaded
`SHEETNAME` = texts or scalars	Name of a spreadsheet worksheet or named range, or number of a worksheet within the file; default is the first sheet in the file
`CELLRANGE` = texts	Cell range within a worksheet, giving the top left and bottom right cell in the format `XXNN:XXNN` where `XX` = `A` – `IV`, `NN` = 1 – 64384; default `*` requests all data on the sheet
`COLUMNS` = texts	Names and/or type codes for the columns read (the type of column can be forced by ending the column name, if supplied, with the code `!` for a factor, `#` for a variate, and `$` for a text), using a name of `'*'` will cause a column to be dropped
`ISAVE` = pointers	Saves the identifiers of the columns
`START` = texts	Contents of a cell in a spreadsheet file or a line in a text file from which to start reading
`END` = texts	Contents of a cell in a spreadsheet file or a line in a text file at which to end reading
`ANCILLARY` = texts	Extra information returned by some file formats (currently only population type from QTL location files)
`ROWSELECTION` = variates	Numbers of the rows to import; if unset, all rows are imported
`COLSELECTION` = variates or texts	Numbers or names of the columns to import; if unset, all the columns are imported

Description

The name of the file, containing the data values to be imported, is specified by the FILE parameter. This can also be an internet URL prefixed with http://, https:// ftp:// or file://. The data source is then downloaded and imported.

Data in the supported file formats are extracted and saved in the specified file format, depending on the extension of OUTFILE. If this is not provided, the type is indicated by the OUTTYPE option, as either GEN (Genstat Command file), GSH (Genstat Spreadsheet), GWB (Genstat Spreadsheet Book), XLS (Excel 5 Spreadsheet), XLSX (Excel 2007 Spreadsheet), TXT (ASCII Text file) or CSV (comma-delimited file); the default is GSH. Setting OUTTYPE=SHEETS reads in the worksheet names in a spreadsheet file (Excel/Quattro/Sigmaplot or SAS Transport) into a text named Worksheets. The ENDSTATEMENT option specifies the ending statement type for a type GEN output file: either RETURN (the default) or ENDBREAK). You can set ENDSTATEMENT=* if you do not want an ending statement.

The PRINT option controls printed output, with the following settings:

`catalogue`	lists the contents of the file (default); and
`summary`	prints a summary of the values in each data structure in the file.

If METHOD=load, the resulting file is read in to Genstat data structures. When IMPORT is used within a procedure, the SCOPE option controls whether the structures are created locally in the procedure (default), or globally in the main program.

In spreadsheet files (Excel, Quattro, 123, SigmaPlot), the SHEETNAME and CELLRANGE parameters can be used to read in just a specified section of the data in the file. If CELLRANGE specifies only the starting cell, IMPORT reads all columns the from the given column onwards, and all rows from the given row downwards. For example, CELLRANGE='C8' reads columns C, D… onwards, and rows 8, 9… downwards, until the end of the data in the sheet. The COLUMNS parameter can be used to set the names and types of the structures (see below).

In spreadsheet files, the data that are extracted are labels, numerical values and the results from formulae. A label of * in an otherwise numerical column is taken as a missing value, unless one or more different missing value markers are specified with the MISSING option. Empty cells are taken as missing values. Empty rows at the start, middle and end of a block are removed. Empty columns are ignored by default; you can set the option KEEPEMPTY=rows or KEEPEMPTY=columns to retain empty rows or columns respectively, or KEEPEMPTY=rows,columns to keep both.

The IMETHOD option indicates how identifiers are to be specified for the columns, with the following settings:

`read`	assumes that the names are in the first non-empty row of data;
`supply`	assumes that the names are supplied by the `COLUMNS` parameter, but uses default names if they are not; and
`none`	uses default names; and
`overlay`	uses the names from the `COLUMNS` parameter, or from the first non-empty row of data for any of those names that is blank.

If IMETHOD=read, and a column name cell contains a numerical value rather than a label, the column name is set to the numerical value prefixed with a % character. The prefix can be changed using the PREFIX option: a column named '15' is given the name %15 by default but, if PREFIX='X', the name would be X15.

The default for IMETHOD is to take the names from the COLUMNS parameter, if this is set and it contains names. Otherwise IMPORT looks for names in the data file (as with the read setting).

The default column names have the prefix C and an integer number (i.e. C1, C2 etc.), but you can supply your own prefix using the IPREFIX option.

Using the COLUMNS parameter, the type of a column can be forced by providing a !, # or $ character on the end of the text items provided for the column names. A string '*' can be given as a name in COLUMNS, to remove a column from the data read in. If only a single type character is given, only the types of the columns (and not their names) are changed. The extension :D on a column name specifies that the values are to be read as dates. Similarly, when the column names are being read from a spreadsheet, their types can be specified by using ! for a factor, # for a variate, $ for a text and :D for a date.

The option FORDER controls the order in which that the labels or levels of a factors are stored. with the default, FORDER=sorted, the levels are stored in ascending numerical order, and the labels are stored in alphabetical order. Alternatively, if FORDER=unsorted the levels and labels are stored in the order in which they are first met in the column.

The TEXTCONVERSION option controls how labels are converted to numbers in a column marked as a variate:

`strict`	only labels that contain numeric data only are converted (e.g. `'10'` becomes 10; `'1O'` becomes `*`)
`single`	a single character substitution is read as a number (`o` or `O` become 0; `i`, `I`, `l` or `L` become 1; `s` or `S` become 2; `z` or `Z` become 5; comma becomes decimal point)
`common`	multiple substitutions as in single are made (e.g. `'Io'` becomes 10; `'23X'` becomes `*`)
`standard`	as in `common` but extra text is ignored at the end of the number (e.g. `'23X'` becomes 23; `'A2X3'` becomes `*`)
`lax`	any digits are read from the text (e.g. `'A2X3'` becomes 23).

You can set option EMETHOD=read, to read a row of column descriptions/extra from a spreadsheet file. By default, this row is taken as the second row in CELLRANGE. The EXTRAROW option can be used to modify the row form which the description is read. The row number is relative to the start of the cell range, unless a negative row number is provided; the descriptions are then read from the row in the spreadsheet, corresponding to the absolute value of the specified row number. If EXTRAROW=1, the column names are read from the second row.

The START parameter can supply a text to indicate where to start reading within a spreadsheet or text file. In a spreadsheet file (Excel, Quattro, Lotus), the cells from A1 are searched row by row, until a label is found that matches the text. Only cells below and to the right of this cell are then imported. The text could thus be the name of the first variable to be read. Note that the text must not contain spaces or the division symbol (/). Similarly the END parameter can supply a text to indicate where to stop reading a spreadsheet or text file.

The TEMPMISSING option controls the input of temporarily missing values. These are values that have been set to missing temporarily in the spreadsheet, and for which the original (non-missing) values are still available. The default is to read the original values, but you can set TEMPMISSING=yes to read them as missing values instead.

The INOPTIONS and OUTOPTIONS options allow extra options to be passed to Dataload.dll. For example: setting INOPTIONS='/k' keeps leading and trailing and doubled blanks in strings, OUTOPTIONS='/u' creates undecorated names in a CSV file (i.e. 'Factor', rather than 'Factor!'), OUTOPTIONS='/c' combines the three columns Red, Green and Blue in a BMP file into a single column RGB, and INOPTIONS='/m' loads the data as a matrix rather than as separate columns.

The RGBMETHOD option controls how to represent colour values from image files (JPG, GIF, TIF or PNG). The default setting, combined, stores an RGB value in a single column in the same form as generated by the RGB function. The separate setting creates three columns containing the red, green and blue values, respectively. Finally, the matrix setting puts the RGB values into a matrix.

The ROWSELECTION and COLSELECTION parameters allow you to import only a subset of the rows or columns, respectively, in the file. They can be set to a variate containing the numbers of the rows or columns. With COLSELECTION, you can also supply a text containing column names. So, for example, to import only rows where the variate X is greater than zero, you could put

ROWSELECTION = WHERE(X.GT.0)

(the WHERE function gives the unit numbers where a logical expression has the value one i.e. true). Note that the variate X must already have been imported into Genstat, but you could import this column on its own using COLSELECTION. If ROWSELECTION (or COLSELECTION) are unset, all the rows (or columns) are imported.

The UNICODE option controls what happens to Unicode characters that are not part of the extended ASCII character set. These may occur, for example, in Excel XLSX files. The default setting, utf8, converts them into the UTF-8 format. In this format, the ASCII characters are stored in the usual way, in a single byte (of eight binary bits). More complicated characters, such as Chinese and Thai characters, require up to four bytes. UTF-8 characters cause no problems with most of the Genstat commands. The commands that cannot handle them, for example EDIT, issue a VA-43 fault. The remove setting removes these from the input. The ascii option converts them to the nearest matching ASCII character. The typeset option converts those that can be represented by Genstat typesetting strings by these strings: for example, α would be converted to ~{alpha}, and √ would be converted to ~{sqrt}. The correspondence between the Greek, ASCII and type-setting commands is shown in the table below. The capitals have the same correspondence. Extended Greek and Latin letters have their accents removed, as there are no type-setting commands for these. Some symbols like ♂ and ♀ are converted to their text equivalent (male and female).

α	a	~{alpha}	ι	I	~{iota}	ρ	r	~{rho}
β	b	~{beta}	κ	k	~{kappa}	σ	s	~{sigma}
γ	g	~{gamma}	λ	l	~{lambda}	τ	t	~{tau}
δ	d	~{delta}	μ	m	~{mu}	υ	u	~{upsilon}
ε	e	~{epsilon}	ν	n	~{nu}	φ	f	~{phi}
ζ	z	~{zeta}	ξ	c	~{xi}	χ	x	~{chi}
η	h	~{eta}	ο	o	~{omicron}	ψ	y	~{psi}
θ	q	~{theta}	π	p	~{pi}	ω	w	~{omega}

The COLUNICODENAMES option controls how column names that contain Unicode characters are used. With the default setting, suffix, a pointer is defined to hold any columns with Unicode in their names, and the column names provide its suffix labels. The name of the pointer is specified in a text by the UNINAME option (default ‘C‘). The extra setting uses the default names for the columns, and the column names from the file are used as extra texts. It also sets the IPRINT attribute of the columns to extra, so that these are printed instead of the default identifiers. (You can modify this to print the default idendifiers instead, by using the Identifying information used in output list box for those columns in the spreadsheet Column Attributes/Format menu.) The ignore setting removes the Unicode characters from the name.

The XLSXCONTENT option specifies the content to import from an Excel XLSX file: values, formulae, foreground colour, background colour, font name, style or size. The default is to read only the values.

(Note: IMPORT replaces the procedure DATALOAD from earlier editions of Genstat.)

Options: PRINT, OUTTYPE, METHOD, IMETHOD, ENDSTATEMENT, SPSSMV, MISSING, FORDER, TEXTCONVERSION, KEEPEMPTY, NAMEROW, EMETHOD, EXTRAROW, PREFIX, TEMPMISSING, INOPTIONS, OUTOPTIONS, RGBMETHOD, SEPARATORS, SCOPE, IPREFIX, TRANSPOSE, UNICODE, COLUNICODENAMES, UNINAME, XLSXCONTENT.
Parameters: FILE, OUTFILE, SHEETNAME, CELLRANGE, COLUMNS, ISAVE, START, END, ANCILLARY, ROWSELECTION, COLSELECTION.

Method

The request is passed to the DATALOAD.DLL library which reads the foreign file and returns any valid data found in a temporary GEN or GSH file. The following file types are supported: Excel 2-5, 95, 97, 2000, XP, 2003, 2007-2013, Open Office, Lotus WK1, Quattro (WQ1, WB*, QPW), dBase 2-5, Paradox 3-9, Genstat GSH and GWB, SAS PC 6.03-12, 7-9, SAS Transport, SAS JMP, Minitab 8-17, Statistica 5 and 6, Systat, MStat, Instat, Epi-Info, SPSS/Win, Gauss Data/Matrix (PC/Win/Unix), MatLab, S+ (PC/Unix), Stata 4-8, StatGraphics, R data frames, Weka Attribute files, SigmaPlot 7-9, OSIRIS, Limdep, Comma delimited text files (*.CSV), Cornell Ecology format, MapQTL trait files (.QUA), ArcView/Info Shapefiles, MapInfo Exchange files, Windows Bitmap (*.BMP), Windows Sound (*.WAV), NMR Binary files and image files (JPG, GIF, TIF, PNG). The file type is worked out from the file contents, so the usual extension need not be used with the exception of the following file types which do not contain a unique signature: Epi-Info (.REC), S+ (.SDD) and Paradox (.DB). Any files not containing a unique file signature, but ending in these extensions will be classified as above. Any other file extensions will attempted to be read as a comma, space or tab delimited text file.

There is a known problem that using the OUTTYPE=GEN inside a FOR loop (or another other procedure) ties up input channels until exiting the FOR loop. Thus it may exhaust the available input channels. Either use the OUTTYPE=GSH or set LOAD=no and write code to input the files created outside the loop (you will need to provide an output file name to do this).

Action with `RESTRICT`

Restrictions are not applicable to any of the parameters.

Example

CAPTION 'AUDISPLAY example',\
        'Data from Genstat 5 Release 1 Reference Manual, page 340.';\
        STYLE=meta,plain
FACTOR  [NVALUES=36; LEVELS=3; VALUES=12(1...3)] Block
FACTOR  [NVALUES=36; LABELS=!t(baresoil,emerald,emergo)] Leachate
&       [LABELS=!t('1','1/4','1/16','1/64')] Dilution
VARIATE [NVALUES=36] Nhatch,Nnohatch
READ    Leachate,Dilution,Nhatch,Nnohatch
  1           2         109         318
  3           4          54         350
  3           1           *         415
  2           2         783         212
  3           3         652        1375
  2           4         490         816
  1           3          95        1219
  2           1        1012          66
  1           4         166         943
  3           2        1059         313
  1           1         257        1006
  2           3        1058         234
  2           4         507        1119
  1           2         194         840
  1           3         175        1707
  1           1         326         609
  3           4         142         980
  2           3         286         230
  3           2         546         313
  2           2           *         301
  2           1        2471         112
  3           3          76         489
  1           4         208         503
  3           1           *         325
  1           1         322         913
  1           2         255        2246
  3           2        1774        1446
  2           2         999         193
  2           4         388        1836
  3           4         221        1800
  1           3         220        1902
  2           1        2821         187
  3           1        1486         463
  3           3         717        1473
  1           4         143         941
  2           3         968         550 :
CALCULATE          Logit%h = LOG(Nhatch/Nnohatch)
BLOCKSTRUCTURE     Block
TREATMENTSTRUCTURE Leachate*Dilution
AUNBALANCED        [PRINT=*] Logit%h
AUDISPLAY

Updated on May 10, 2023

Was this article helpful?

Yes No