JOIN procedure

Joins or merges two sets of vectors together, based on the values of sets of classifying keys (C.F. Johnston & D.B. Baird).

Options

`NINDEX` = scalar	Number of index vectors in structures (up to 10); default 1
`METHOD` = string token	Type of join (`inner`, `left`, `right`, `full`); default `full`
`REPEATS` = string token	How to handle repeats of matches (`combinations`, `single`); default `sing` outputs one row per match
`INCLUDE` = string token	How to handle restrictions on the input vectors (`all`, `nonrestricted`); default `all` uses all the data rows
`SORT` = string token	Whether `NEWVECTORS` should be sorted on the index vectors (`ascending`, `descending`, `unsorted`); default `unsorted` keeps the same ordering as the input sets

Parameters

`LEFTVECTORS` = pointer	Pointer to a list of vectors in left set (keys and variables)
`RIGHTVECTORS` = pointer	Pointer to a list of vectors in right set (keys and variables)
`NEWVECTORS` = pointer	Pointer to a list of output vectors (keys and variables)

Description

This procedure can be used to produce a set of NEWVECTORS, which is the result of joining (or merging) two sets according to index (or key) vectors in each set. JOIN supports SQL style joins, as well as merges, as implemented in Genstat for Windows, SAS and SPSS.

The number of index vectors is given by the NINDEX option (up to 10). Each of LEFTVECTORS and RIGHTVECTORS is a pointer to NINDEX keys followed by any number of extra vectors. The NEWVECTORS parameter is a pointer to NINDEX keys followed by the total number of non-index vectors in the two input sets. The output order in NEWVECTORS will be the combined keys from the left and right sets, then the non-index vectors from the left set, followed by the non-index vectors from the right set. You need not have declared the pointer already; it will be declared automatically if necessary. The vectors may be variates, factors or texts. Warnings are given if the types of index vectors in each set do not match, although a factor can be matched with a text. Attempting to match a text with a factor or variate will result in a fault.

The METHOD option controls the type of join and determines which rows from each input set will be output. METHOD=inner outputs only those rows where the keys from both sets match. METHOD=left outputs all rows from the LEFTVECTORS set and only those rows from RIGHTVECTORS where the keys from both sets match. METHOD=right outputs all rows from the RIGHTVECTORS set and only those rows from LEFTVECTORS where the keys from both sets match. METHOD=full outputs all rows from both sets. Where keys do not match, missing values are inserted into the non-index vectors from the set without that key value.

The REPEATS option determines what happens when both input sets have repeats of the same matching key values. REPEATS=single outputs one row for each match, so that if there are M repeats in LEFTVECTORS and N repeats in RIGHTVECTORS, MAX(M,N) rows will be output. This is the same behaviour as the merge statements of SAS and SPSS and the Merge Spreadsheets menu of Genstat for Windows. REPEATS=combinations outputs all combinations of the repeats, giving M*N rows. This is equivalent to an SQL join and may produce very large output sets.

The INCLUDE=nonrestricted option allows the use of restrictions on the vectors in each input set to be used to subset the rows. The SORT=unsorted option allows the resulting vectors to be returned in the original order of the input data set, or sorted on the key vectors in either ascending or descending direction.

Note: this procedure may take some time to complete for joins of large data sets. You should also ensure the data space is large enough for the resultant vectors, especially if using the option REPEATS=combinations.

Options: NINDEX, METHOD, REPEATS, INCLUDE, SORT.

Parameters: LEFTVECTORS, RIGHTVECTORS, NEWVECTORS.

Method

The LEFTVECTORS and RIGHTVECTORS are sorted by the index variables. If there are restrictions on the vectors in either input set, this is used to subset the input vectors if INCLUDE=nonrestricted. For each of the rows of the two sets, the keys are compared and output rows are appended according to the METHOD option. If repeats of the same matching key values in both sets occur and REPEATS=combinations the procedure loops through all combinations of the matching rows. The resulting vectors are then sorted into the order specified in the SORT option.

Action with `RESTRICT`

Any of the input vectors may be restricted. If INCLUDE=nonrestricted, only those rows which are not excluded by any restriction on vectors in each input set will be processed, otherwise the restrictions will be ignored.

Example

CAPTION  'JOIN example'; STYLE=meta
VARIATE  [NVALUES=7] K11,K12,V11
TEXT     [NVALUES=7] T13
READ     K11,K12,V11,V12,T13
1 1   1 6.2 'Red'
1 1   2 5.7 'Green'
1 1   3 4.5 'Blue'
1 2   4 7.3 'Red'
2 1   5 4.1 'Yellow'
3 2   6 5.1 'Blue'
3 3   7 1.9 'Black' :
VARIATE [NVALUES=5] K21,K22,V21
READ    K21,K22,V21
1 1   1
1 1   2
2 1   3
2 3   4
5 2   5 :
PRINT    K11,K12,V11,V12,T13; DECIMALS=3(0),1,0
&        K21,K22,V21; DECIMALS=0
JOIN     [NINDEX=2; METHOD=inner; REPEATS=single]\ 
         LEFT=!p(K11,K12,V11,V12,T13); RIGHT=!p(K21,K22,V21);\ 
         NEW=!p(K1,K2,V1,V2,T3,V4)
PRINT    K1,K2,V1,V2,T3,V4; FIELDWIDTH=8; DECIMALS=0,0,0,1,0,0
JOIN     [NINDEX=2; METHOD=left; REPEATS=single]\ 
         LEFT=!p(K11,K12,V11,V12,T13); RIGHT=!p(K21,K22,V21);\ 
         NEW=!p(K1,K2,V1,V2,T3,V4)
PRINT    K1,K2,V1,V2,T3,V4; FIELDWIDTH=8; DECIMALS=0,0,0,1,0,0
JOIN     [NINDEX=2; METHOD=right; REPEATS=single]\ 
         LEFT=!p(K11,K12,V11,V12,T13); RIGHT=!p(K21,K22,V21);\ 
         NEW=!p(K1,K2,V1,V2,T3,V4)
PRINT    K1,K2,V1,V2,T3,V4; FIELDWIDTH=8; DECIMALS=0,0,0,1,0,0
JOIN     [NINDEX=2; METHOD=full; REPEATS=single; SORT=descending]\ 
         LEFT=!p(K11,K12,V11,V12,T13); RIGHT=!p(K21,K22,V21);\ 
         NEW=!p(K1,K2,V1,V2,T3,V4)
PRINT    K1,K2,V1,V2,T3,V4; FIELDWIDTH=8; DECIMALS=0,0,0,1,0,0
JOIN     [NINDEX=2; METHOD=full; REPEATS=comb; SORT=ascending]\ 
         LEFT=!p(K11,K12,V11,V12,T13); RIGHT=!p(K21,K22,V21);\ 
         NEW=!p(K1,K2,V1,V2,T3,V4)
PRINT    K1,K2,V1,V2,T3,V4; FIELDWIDTH=8; DECIMALS=0,0,0,1,0,0
RESTRICT K11; K11 < 3
JOIN     [NINDEX=2; METHOD=full; REPEATS=comb; INCLUDE=nonrestricted]\ 
         LEFT=!p(K11,K12,V11,V12); RIGHT=!p(K21,K22,V21);\ 
         NEW=!p(K1,K2,V1,V2,V3)
PRINT    K1,K2,V1,V2,V3; FIELDWIDTH=8; DECIMALS=0,0,0,1,0
RESTRICT K21; K21 < 3
JOIN     [NINDEX=1; METHOD=full; REPEATS=single; INCLUDE=all] \
         LEFT=!p(K11,V11,V12,T13); RIGHT=!p(K21,V21);\ 
         NEW=!p(K1,V1,V2,T3,V4)
PRINT    K1,V1,V2,T3,V4; FIELDWIDTH=8; DECIMALS=0,0,1,0,0

Updated on March 7, 2019

Was this article helpful?

Yes No