Random forests - classification manual

Introduction
Standard control options
Line 1: Describe data
Line 2: Set run options
Line 3: Set importance options
Line 4: Set proximity computations
Line 5: Set options based on proximities
Line 6: Replace missing values
Line 7: Visualization
Line 8: Saving a forest
Line 9: Running new data down a saved forest
Customized options
Input data files
Categorical input variables
Class weights
Using a prespecified subset of input variables
Using the most important input variables
Using a saved forest
Output control options
File names
Example settings - satimage data

Introduction

Probably the best way to learn how to use the random forests code is to study the satimage example. The options are documented below.

Standard control options

Unless otherwise noted, setting an option to 0 turns the feature "off".

Line 1: Describe data

mdim = number of variables (features, attributes).
nsample0 = number of cases (examples or instances) in the training set.
nclass = number of classes.
maxcat = the largest number of values assumed by a categorical input variable. If there are no categorical variables set maxcat=1.
ntest = the number of cases in the test set. NOTE: Put ntest=0 if there is no test set.
labelts = 0 if the test set has no class labels,=1 if the test set has class labels.
labeltr = 0 if the training set has no class labels,=1 if it has class labels. If the training set has no class labels, it will be treated in unsupervised mode.

Line 2: Set run parameters

mtry0 = the number of variables to split on at each node. Default is the square root of mdim. ATTENTION! DO NOT USE THE DEFAULT VALUES OF MTRY0 IF YOU WANT TO OPTIMIZE THE PERFORMANCE OF RANDOM FORESTS. TRY DIFFERENT VALUES-GROW 20-30 TREES, AND SELECT THE VALUE OF MTRY THAT GIVES THE SMALLEST OOB ERROR RATE.
ndsize = minimum node size to split. Default is one.
jbt = the number of trees to be grown in the forest.
look = error estimate is printed to the screen every "look" trees, and at the end of the run.
lookcls = 1 error estimates for individual classes are printed to the screen every "look" trees, and at the end of the run.
jclasswt =0, all classes receive the same weight. jclasswt = 1 allows weightings to adjust error rates between classes.
mdim2nd = k>0 automatic 2nd run using the k most important variables from the first run.
mselect = 1 allows a hand-picked selection of variables to be used in the run.
iseed controls the random number generator (use any nonzero integer).

Line 3: Set importance options

imp =1 turns on the variable importance computation.
interact = 1 turns on the interaction computation between pairs of input variables.
impn =1 turns on the variable importance computation for each individual case.
impfast = 1 turns on a fast way of computing importance. (But not the preferred measure).

Line 4: Set proximity computations

nprox = 0, 1 or 2. If nprox=1, the proximity measures between any training case and its nrnn most proximate cases in the training set are computed. If it takes the value 2, in addition, the proximity measures between any test set case and its nrnn most proximate cases in the training set are computed. If either noutlieror mfixrep is greater or equal to one, then nprox must be 1 or 2.

Line 5: Set options based on proximities

noutlier = 1 computes the outlier measure for each case in the training set and noutlier = 2 also computes the outlier measure for each case in the test set. (Outlier detection requires nprox >0).
nscale = k>0 computes the first k canonical coordinates used in metric scaling.
nprot = k>0 computes k prototypes for each class.

Line 6: Replace missing values

code = kkk every time kkk appears in the file, a missing value is indicated.
missfill= 1,2 does a fast replacement of the missing values, for the training set (if equal to 1) and a more careful replacement (if equal to 2).
mfixrep= k with missfill=2 does a slower, but usually more effective, replacement using proximities with k iterations on the training set only. (Requires nprox >0).

Line 7: Visualization

iviz = 1 to create output files for input to the java graphics. Requires impn=1, imp=1, nscale=3 and nprox=1.

Line 8: Saving a forest

isaverf = 1 to save the existing forest. (More information is given in the satimage example).
isavepar = 1 to save the run information and a comment.
isavefill = 1 to save the missing value fills. Required if the forest is to be saved and used for new data with missing values.
isaveprox = 1 to save the proximity matrix.

Line 9: Running new data down a saved forest

irunrf = 1 to read in a saved forest. (More information is given in the satimage example).
ireadfill = 1 to read in missing value fills. Required if the forest is to be used for new data with missing values.
ireadpar = 1 to read in saved parameter values. Note: if ireadpar is set equal to one, then the parameters saved are written to screen and the program stops. This facility is provided to recall the parameter values from a previous run.
ireadprox = 1 to read in a proximity matrix.

Customized options

Input data files

For a J-class problem, random forests expects the classes to be numbered 1,2, ...,J. The code to read in the training and/or test data is:

c       -------------------------------------------------------
c       READ IN DATA--SEE MANUAL FOR FORMAT
c
        open(16, file='data.train', status='old')
        do n=1,nsample0
                read(16,*) (x(m,n),m=1,mdim), cl(n)
        enddo
        close(16)
        if(ntest.gt.1) then
                open(17, file='data.test', status='old')
                do n=1,ntest0
                        read(17,*) (xts(m,n),m=1,mdim),clts(n)
                end do
                close(17)
        end if

To change the dataset names or to manipulate the data, edit the code as required. For the training data, always use the notation x(m,n) for the value of the mth variable in the nth case, and cl(n) for the class number (integer). For the test data, always use the notation xts(m,n) for the value of the mth variable in the nth case, and clts(n) for the class number (integer).

Categorical input variables

If there are categorical variables, set maxcat equal to the largest number of categories and then specify the number of categories for each categorical variable in the integer vector cat, which is defined in the highlighted lines of code:

c	-------------------------------------------------------
c	     SET CATEGORICAL VALUES
c
       do m=1,mdim
          cat(m)=1
       enddo
c      fill in cat(m) for all variables m for which cat(m)>1
       cat(1)= FILL IN THE VALUE 
       ... 
       cat(mdim)= FILL IN THE VALUE

For example, setting cat(5)=7 implies that the 5th variable is a categorical with 7 values. Any variables with cat=1 will be assumed to be continuous. Note: for an L valued categorical input variable, random forests expects the values to be numbered 1,2, ... ,L.

Class weights

If the classes are to be assigned different weights, set jclasswt=1 and fill in the desired weights in the highlighted lines of the code:

c      SET CLASS WEIGHTS
       do j=1,nclass
          classwt(j)=1
       end do
       if(jclasswt.eq.1) then
          classwt(1)= FILL IN THE VALUE
          ...
          classwt(nclass)= FILL IN THE VALUE
       end if

Using a prespecified subset of input variables

Look for this text early in the program:

	if(mselect.eq.0) then
		mdimt=mdim
		do k=1,mdim
			msm(k)=k
		end do
	end if
	if (mselect.eq.1) then
		mdimt= FILL IN THE VALUE
		msm(1)= FILL IN THE VALUE
		...
		msm(mdimt)= FILL IN THE VALUE
	end if

If mselect = 1, mdimt is the number of variables to be used and the values of msm(1),.,msm(mdimt) specify which variables should be used.

Using the most important input variables

If imp = 1, and mdim2nd = k>0, then the program does a 2nd run. In the 2nd run, it uses only the k most important variables found in the first run. If there were missing values, the initial run on all mdim variables determines the fill-ins.

Using a saved forest

To save the forest, set isaverf = 1. To run the forest on new data, set isaverf = 1 and read in the the new data as a test set using the notation clts for the class label and xts for the variables. Basic parameters have to agree with the original run, i.e. nclass, mdim, jbt.

Missing values: If the data to run down the saved tree has missing values that need to be replaced, then in the initial run, set missfill=1 and isavefill =1. To run the forest on new data, set missfill=2 and ireadfill =1.

Outliers: Whether or not outlier measure in the runs down the saved tree is enabled is independent of whether missing values are replaced. To enable the outlier measure in new data, in the initial run, set isaveprox =1. To run the forest on new data, set noutlier =2 and ireadprox =1.

Output control options

The options to control output are specified in the following lines of code:

c
c       -------------------------------------------------------
c       OUTPUT CONTROLS
c
        parameter(
     &          isumout =       1,      !0/1    1=summary to screen
     &          idataout=       1,      !0/1/2  1=train,2=adds test     (7)
     &          impfastout=     1,      !0/1    1=gini fastimp          (8)
     &          impout=         1,      !0/1/2  1=imp,2=to screen       (9)
     &          impnout=        1,      !0/1    1=impn                  (10)
     &          interout=       1,      !0/1/2  1=interaction,2=screen  (11)
     &          iprotout=       1,      !0/1/2  1=prototypes,2=screen   (12)
     &          iproxout=       1,      !0/1/2  1=prox,2=adds test      (13)
     &          iscaleout=      1,      !0/1    1=scaling coors         (14)
     &          ioutlierout=    1)      !0/1/2  1=train,2=adds test     (15)

The values are all set to 1 in the code, but can be altered as necessary. The comments after the exclamation points give a quick idea of what the settings are. More details are given below:

isumout can take the values 0 or 1. If it takes the value 1, a classification summary is sent to the screen.

idataout can take the values 0,1, or 2. If it has the value 1 then the training set is read to file with the format such that each row contains the data for a case. The columns are:

	 n,cl(n),jest(n),(q(j,n), j=1,nclass),(x(m,n),m=1,mdim)

where n=case number, cl(n)=class label, jest(n)=predicted class for case n, q(j,n)=the proportion of votes for the jth class out of the total of all the votes for the nth case, and x(m,n) are the values of the input variables for the nth case.

If it has the value 2 then both the training set (in the above format) and the test set are read to the same file - first the training set. If the test set has labels then the format is:

	n,clts(n),jests(n), (qts(j,n), j=1,nclass),(xts(m,n),m=1,mdim).

This is the same as the format for the training set except that it holds the corresponding values for the test set. If the test set has no labels, the format is the same except that clts(n) is deleted.

impfastout takes the values 0 or 1. If it takes the value 1 a two-column array is printed out with as many rows as there are input variables. The first column is the variable number. The second is the total gini contribution for the corresponding variable summed over all trees and normalized to make the average of the column equal to one.

impout =0 or 1. If it takes the value 1, the output consists of four columns with a row for each variable. The first column is the variable number. The second column is the raw importance score (discussed later). The third column is the raw score divided by its standard error i.e. its z-score. The fourth column assigns a significance level to the z-score, assuming it is normally distributed.

interout = 0,1, or 2. If it takes the value 1, the output is sent to a file. If it takes the value 2, the output is sent to the screen. The first output is a two column list headed by the word "CODE". The first column consists of successive integers starting from one on to mdimt - the number of variables being used in the current run. The second column is the original variable number corresponding to the number in the first column. Then a square matrix of size mdimt on a side is printed out. The matrix, in the k,m place, contains the interaction between variables k and m rounded to the nearest digit (for readability). A bordering top row and first column contain the variable numbers.

iprotout = 0,1, or 2. If it takes the value 1, it writes to a file, if it takes the value 2, it writes to the screen. If nprot is set equal to k>0, it attempts to compute k prototypes for each class. If all prototypes can be computed, the results are in a matrix with 3*nclass*nprot+1 columns and mdimt+1 rows. However, it might not be possible to compute all nprot prototypes for each class, in which case there will be fewer columns. The first column contains the variable number. The next 3 columns contain the first class-1 prototype and its lower and upper "quartiles" (as described here). The next 3 columns contain the second class-1 prototype and its lower and upper "quartiles", and so on. Once the class-1 prototypes are done, the class-2 prototypes are given, and so on.

There are 3 extra rows at the top of the output. The first row gives the number of that class that are closest to the prototype. The second row is the prototype (from 1 to nprot), the third row is the class.

iproxout = 0,1, or 2. If it takes the value 1, the output sent to file is a rectangular matrix with nsample rows and nrnn+1 columns:

	n,(loz(n,k),prox(n,k)), k=1,nrnn)

The first is the case number. This is followed by nrnn couples. Each couple consists of, first, the number of a case that has among the nrnn largest proximities to case n, second, the value of the proximity.

If there is a test set present the training set output is followed by a rectangular matrix of depth ntest and nrnn+1 columns:

	n,(lozts(n,k), proxts(n,k)), k=1,nrnn).

The first is the case number in the test set. The nrnn couples consist, first, of the number of a case in the training set that has among the nrnn largest proximities in the training set to case n, second, the value of the proximity.

iscaleout = 0 or 1. If it takes the value 1, the data about the nscale +1 scaling coordinates are output to file. The format is a rectangular matrix of depth nsample. The columns are:

	n, cl(n), jest(n), (xsc(n,k),k=1,nscale+1).

The first is the case number, the second is the labeled class, third is the predicted class, and the next nscale+1 are the coordinates of the nscale+1 scaling coordinates.

ioutlierout = 0,1, or 2. If it takes the value 1, the output is a rectangular matrix of depth nsample with columns:

	n, cl(n), amax1(outtr(n),0.0)

The first is the case number. The second is the labeled class. The third is the normalized outlier measure truncated below at zero. If it takes the value 2, then additional output about outliers in the test set is added. This consists of a matrix of depth ntest and columns:

	n, jests(n), amax1(outts(n),0.0)

The difference from the above is that the predicted class (jests) is outputted instead of the class label, which may or may not exist for the test set.

File names

The following code specifies all file names.

c
c	-------------------------------------------------------
c	READ OLD TREE STRUCTURE AND/OR PARAMETERS
c
 	if (irunrf.eq.1)
     &		open(1,file='savedforest',status='old')
	if (ireadpar.eq.1)
     &		open(2,file='savedparams',status='old')
	if (ireadfill.eq.1)
     &		open(3, file='savedmissfill',status='old')
 	if (ireadprox.eq.1)
     &		open(4, file='savedprox', status='old')
c
c	-------------------------------------------------------
c	NAME OUTPUT FILES FOR SAVING THE FOREST STRUCTURE
c
 	if (isaverf.eq.1)
     &		open(1, file='savedforest',status='new')
	if (isavepar.eq.1)
     &		open(2, file='savedparams',status='new')
	if (isavefill.eq.1)
     &		open(3, file='savedmissfill',status='new')
 	if (isaveprox.eq.1)
     &		open(4, file='savedprox', status='new')
c
c	-------------------------------------------------------
c	NAME OUTPUT FILES TO SAVE DATA FROM CURRENT RUN
c
 	if (idataout.ge.1)
     &		open(7, file='save-data-from-run',status='new')
	if (impfastout.eq.1)
     &		open(8,file='save-impfast',status='new')
 	if (impout.eq.1)
     &		open(9,file='save-importance-data',status='new')
 	if (impnout.eq.1)
     &		open(10,file='save-caseimp-data',status='new')
	if (interout.eq.1)
     &		open(11,file='save-pairwise-effects',status='new')
	if (iprotout.eq.1)
     &		open(12,file='save-protos',status='new')
 	if (iproxout.ge.1)
     &		open(13, file='save-run-proximities',status='new')
	if (iscaleout.eq.1)
     &		open(14, file='save-scale',status='new')
	if (ioutlierout.ge.1)
     &		open(15, file='save-outliers',status='new')
c
c	-------------------------------------------------------
c	READ IN DATA--SEE MANUAL FOR FORMAT
c
	open(16, file='satimage.tra', status='old')
	do n=1,nsample0
		read(16,*) (x(m,n),m=1,36), cl(n)
	end do
	close(16)
	if(ntest.gt.1) then
		open(17, file='satimage.tes', status='old')
		do n=1,ntest0
			read(17,*) (xts(m,n),m=1,36),clts(n)
		end do
		close(17)
	end if

Example settings - satimage data

This is a fast summary of the settings and options for a run of random forests version 5. The code for this example is available here. The code below is what the user sees near the top of the program. The program is set up for a run on the satimage training data which has 4435 cases, 36 input variables and six classes. The satimage test set has 2000 cases with class labels. All options are turned off.

        parameter(
c               DESCRIBE DATA
     1          mdim=36, nsample0=4435, nclass=6, maxcat=1,
     1          ntest=2000, labelts=1, labeltr=1,
c
c               SET RUN PARAMETERS
     2          mtry0=6, ndsize=1, jbt=50, look=10, lookcls=1,
     2          jclasswt=0, mdim2nd=0, mselect=0, iseed=4351,
c
c               SET IMPORTANCE OPTIONS
     3          imp=0, interact=0, impn=0, impfast=0,
c
c               SET PROXIMITY COMPUTATIONS
     4          nprox=0, nrnn=5,
c
c               SET OPTIONS BASED ON PROXIMITIES
     5          noutlier=0, nscale=0, nprot=0,
c
c               REPLACE MISSING VALUES  
     6          code=-999, missfill=0, mfixrep=0,
c
c               GRAPHICS
     7          iviz=0,
c
c               SAVING A FOREST
     8          isaverf=0, isavepar=0, isavefill=0, isaveprox=0,
c
c               RUNNING A SAVED FOREST
     9          irunrf=0, ireadpar=0, ireadfill=0, ireadprox=0)

There are two data files corresponding to the data description. The training data is satimage.tra, the test data is satimage.tes. The training data is read in with the lines:

        open(16, file='satimage.tra', status='old')
        do n=1,nsample0
                read(16,*) (x(m,n),m=1,mdim),cl(n)
        enddo
        close(16)

For the training data, always use the notation x(m,n) for the value of the mth variable in the nth case, and cl(n) for the class number (integer). The test data is read in with the lines:

   if(ntest.gt.1) then
                open(17, file='satimage.tes', status='old')
                do n=1,ntest0
                        read(17,*) (xts(m,n),m=1,mdim),clts(n)
                enddo
                close(17)
        endif

For test data, always use the notation xts(m,n) for the value of the mth variable in the nth case, and clts(n) for the class number (integer). Compile and run the code to get the following output on the screen:

=============================================================
 class counts-training data
   1072  479  961  415  470  1038
 
 class counts-test data
   461  224  397  211  237  470
      10     13.39      3.45      3.76      6.97     44.58     19.79     18.69
      10     11.05      1.95      3.12      5.79     37.91     15.19     14.04
 
      20     10.73      2.52      2.71      5.20     42.65     15.32     13.20
      20      9.70      1.08      2.23      5.54     37.91     11.81     11.49
 
      30      9.81      2.05      2.30      4.47     40.24     14.89     11.75
      30      9.50      0.87      3.12      5.79     37.44     11.39     10.64
 
      40      9.31      1.96      2.51      4.16     39.76     14.04     10.50
      40      9.15      1.30      2.68      4.28     36.02     11.81     10.64
 
      50      9.00      2.05      2.30      3.95     39.76     13.40      9.63
      50      9.20      0.87      2.68      5.04     36.49     10.97     10.85
 
 final error rate %        8.99662
 final error test %        9.20000
 
 Training set confusion matrix (OOB):
           true class 
 
     1     2     3     4     5     6
 
     1  1050     0     6     4    25     1
     2     1   468     0     1     2     2
     3    14     0   923    82     1    19
     4     1     4    20   250     6    56
     5     6     5     1     4   407    22
     6     0     2    11    74    29   938
 
 Test set confusion matrix:
           true class 
 
     1     2     3     4     5     6
 
     1   457     0     4     0     8     0
     2     0   218     0     3     3     0
     3     2     1   377    33     0    12
     4     0     1    10   134     1    28
     5     2     2     0     1   211    11
     6     0     2     6    40    14   419

The pairs of lines in the output give

	oob estimates of overall error rate and class error rates,
	test set estimates for the same.

The output may vary slightly with different compilers and settings of the random number seed.

Suppose you want to save the forest. Then the code is here and the options read as below:

        parameter(
c               DESCRIBE DATA
     1          mdim=36, nsample0=4435, nclass=6, maxcat=1,
     1          ntest=0, labelts=0, labeltr=1,
c
c               SET RUN PARAMETERS
     2          mtry0=6, ndsize=1, jbt=50, look=10, lookcls=1,
     2          jclasswt=0, mdim2nd=0, mselect=0, iseed=4351,
c
c               SET IMPORTANCE OPTIONS
     3          imp=0, interact=0, impn=0, impfast=0,
c
c               SET PROXIMITY COMPUTATIONS
     4          nprox=0, nrnn=5,
c
c               SET OPTIONS BASED ON PROXIMITIES
     5          noutlier=0, nscale=0, nprot=0,
c
c               REPLACE MISSING VALUES  
     6          code=-999, missfill=0, mfixrep=0,
c
c               GRAPHICS
     7          iviz=0,
c
c               SAVING A FOREST
     8          isaverf=1, isavepar=0, isavefill=0, isaveprox=0,
c
c               RUNNING A SAVED FOREST
     9          irunrf=0, ireadpar=0, ireadfill=0, ireadprox=0)

Compile and run the code to get the following output on the screen:

=============================================================
  class counts-training data
   1072  479  961  415  470  1038
      10     13.39      3.45      3.76      6.97     44.58     19.79     18.69
      20     10.73      2.52      2.71      5.20     42.65     15.32     13.20
      30      9.81      2.05      2.30      4.47     40.24     14.89     11.75
      40      9.31      1.96      2.51      4.16     39.76     14.04     10.50
      50      9.00      2.05      2.30      3.95     39.76     13.40      9.63
 final error rate %        8.99662
 
 Training set confusion matrix (OOB):
           true class 
 
     1     2     3     4     5     6
 
     1  1050     0     6     4    25     1
     2     1   468     0     1     2     2
     3    14     0   923    82     1    19
     4     1     4    20   250     6    56
     5     6     5     1     4   407    22
     6     0     2    11    74    29   938

To run the 2000-case satimage test set down the forest saved from the training set run, the code is here and the options look like this:

        parameter(
c               DESCRIBE DATA
     1          mdim=36, nsample0=1, nclass=6, maxcat=1,
     1          ntest=2000, labelts=1, labeltr=1,
c
c               SET RUN PARAMETERS
     2          mtry0=6, ndsize=1, jbt=50, look=10, lookcls=1,
     2          jclasswt=0, mdim2nd=0, mselect=0, iseed=4351,
c
c               SET IMPORTANCE OPTIONS
     3          imp=0, interact=0, impn=0, impfast=0,
c
c               SET PROXIMITY COMPUTATIONS
     4          nprox=0, nrnn=5,
c
c               SET OPTIONS BASED ON PROXIMITIES
     5          noutlier=0, nscale=0, nprot=0,
c
c               REPLACE MISSING VALUES  
     6          code=-999, missfill=0, mfixrep=0,
c
c               GRAPHICS
     7          iviz=0,
c
c               SAVING A FOREST
     8          isaverf=0, isavepar=0, isavefill=0, isaveprox=0,
c
c               RUNNING A SAVED FOREST
     9          irunrf=1, ireadpar=0, ireadfill=0, ireadprox=0)

Compile and run the code to get the following output on the screen:

=============================================================
      10     11.05      1.95      3.12      5.79     37.91     15.19     14.04
      20      9.70      1.08      2.23      5.54     37.91     11.81     11.49
      30      9.50      0.87      3.12      5.79     37.44     11.39     10.64
      40      9.15      1.30      2.68      4.28     36.02     11.81     10.64
      50      9.20      0.87      2.68      5.04     36.49     10.97     10.85
 final error test %        9.20000
 Test set confusion matrix:
           true class 
 
     1     2     3     4     5     6
 
     1   457     0     4     0     8     0
     2     0   218     0     3     3     0
     3     2     1   377    33     0    12
     4     0     1    10   134     1    28
     5     2     2     0     1   211    11
     6     0     2     6    40    14   419

Random Forests
Leo Breiman and Adele Cutler

Contents

Introduction

Standard control options

Line 1: Describe data

Line 2: Set run parameters

Line 3: Set importance options

Line 4: Set proximity computations

Line 5: Set options based on proximities

Line 6: Replace missing values

Line 7: Visualization

Line 8: Saving a forest

Line 9: Running new data down a saved forest

Customized options

Input data files

Categorical input variables

Class weights

Using a prespecified subset of input variables

Using the most important input variables

Using a saved forest

Output control options

File names

Example settings - satimage data

Random Forests Leo Breiman and Adele Cutler

Contents

Introduction

Standard control options

Line 1: Describe data

Line 2: Set run parameters

Line 3: Set importance options

Line 4: Set proximity computations

Line 5: Set options based on proximities

Line 6: Replace missing values

Line 7: Visualization

Line 8: Saving a forest

Line 9: Running new data down a saved forest

Customized options

Input data files

Categorical input variables

Class weights

Using a prespecified subset of input variables

Using the most important input variables

Using a saved forest

Output control options

File names

Example settings - satimage data

Random Forests
Leo Breiman and Adele Cutler