“Willows” is a software package that includes three classifiers: classification tree, random forest, and deterministic forest. This package is built on the basis of Heping Zhang's RTREE program with two distinctive features. First, the cumulation of data on single nucletide polymorphisms (SNPs) has created data so huge that we have to take specific steps to improve the memory use of the existing software. “Willows” implements the most efficient memory use for SNP data, while maintaining its general functionality. The second important feature of “Willows” is a friendly graphical user interface.
Contents
|
|
|
1.
H Zhang, M Wang and X Chen.
Willows: A Memory Efficient Tree and
2. H Zhang and B Singer. Recursive Partitioning in Health Sciences. Springer, 1999.
3.
H Zhang, C Yu, and B Singer, B.
Cell and Tumor Classification using Gene Expression Data: Construction of
Forests. Proceedings of the National
4. L Breiman L. Random forests. Machine Learning, 45: 5-32, 2001.
Condition of use
“Willows” can be used and distributed free
of charge under the following terms and conditions.
Any research using this program or the methods or ideas behind it
should acknowledge the use of “Willows”, and cite at least the first
reference in the Citation section.
THIS SOFTWARE IS PROVIDED BY THE
COLLABORATIVE CENTER FOR STATISTICS IN SCIENCE AT YALE UNIVERSITY “AS IS” AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COLLABORATIVE CENTER FOR STATISTICS IN
SCIENCE AT YALE UNIVERSITY BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
Willows 1.1
November 20, 2009
|
|
Recursive partition is the general algorithm behind the tree and forest construction has been discussed in detail by Zhang and Singer (1999), Breiman (2001), and Zhang et al. (2003). Here, we focus on the most unique feature of “Willows”; namely, the most efficient use of memory for SNP data. In genetic studies, a SNP-based genotype has four possible choices: 0 (AA), 1 (AB), 2 (BB) or 3 (missing). Each choice can be represented by 2 bits. Thus, 16 genotypes can be packed into one integer data type (4 bytes) in Java or C++ using bit shift operators. The theoretical compression ratio is 4:1 compared to a byte storage scheme, and 32:1 compared to a double storage scheme.
(I)
“Willows” supports input files in the same simple format as in RTREE : the first line indicates the variable type (response, nominal or ordinal) with no particular order. “R” or “r” implies the response; “D” or "d" implies deleting that covariate from the analysis; “O” or "o" implies a continuous or an ordinal covariate; “N” or "n" implies nominal covariate. The columns should be separated by spaces or tabs (any number of these is allowed). In the predictors matrix, ‘NA’ or ‘.’ represent missing value. Response can only be integers, missing value in response column is not allowed. An improvement from the RTREE is that it reads an extra line for the name of each covariate.
We explain this file using the sample.txt for the general classification purpose, and sample_SNP.txt for the use involving SNPs as predictors. The format of sample_SNP.txt is the same as sample.txt, except that the SNP predictors must be coded as 0, 1, or 2 if available, and NA or ‘.’ (without the quotes) if it is missing.
The first row in sample.txt and sample_SNP.txt is
|
r o o o o o o o o o o o o o o o o o o o
o |
The first letter “r” means that the
first column represents the response, and the other “o” specifies that all
predictors are ordinal and none of the predictors will be excluded.
The second row in sample.txt and sample_SNP.txt is
|
a a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12
... |
The first letter “a” is the name for
the response column, and “a1…” specifies the name of each covariate.
The data begin from the second row.
Specifically, the second row in sample.txt is as follows:
|
0 0.2006632 0.588495 0.5081931 0.1813249 ... |
and the second row in sample_SNP.txt is as follows:
|
0 1 0 1 0 0 2 1 1
1 1 2 2
0 0 2 1 2 0 1 1 |
The remaining rows are organized similarly.
To use the prediction function that predicts an outcome based
on the predictors, additional data files are needed as inputs for prediction.
The format of additional input files is the same as the main input file with one exception. The response column is not required. If it is included, it will be ignored for the prediction purpose.
All output files are all saved in local drives for future review and analysis. They can be imported back to “Willows” at any time.
(I) .inf file.
The .inf file describes the datasets and the tree structures.
Part I: Summary of the data set
It describes the number of covariates, the types of the covariates, the frequencies or ranges of the covariates
Part II: Information for tree reconstruction
Each tree is described in by one row. In each row, the 1st number is the serial ID of the tree. The remaining numbers describe the nodes in the tree.
An internal node is characterized as follows: (1) the node number within the tree (node 1 is the root node); (2) the sequence number for the left daughter node (when there is no daughter node, the node is terminal, and in this case, 0 is used to indicate a terminal node); (3) the sequence number of the splitting variable; and (4) the corresponding splitting value (a numerical splitting value means that the splitting variable is ordinal and a set of integers means that the splitting variable is nominal).
A terminal node is characterized as follows: (1) the node number within the tree (node 1 is the root node); (2) 0 (because it has no daughter node); and (3) the frequency counts of the response variable (one for each level of the response).
The .importance file records the out-of-bag error and the raw importance scores of each variable calculated by the random forest method.
The format is as follows:
The 1st line: the out of bag error rate.
The 2nd line: the number of samples
The 3rd line: string: “index importance”
The 4th line and the rest:
Column 1: variable index
Column 2: importance score
The .count file records the frequency counts that each predictor is used for splitting in all trees of a deterministic forest.
The format is as follows:
The 1st line: number of variables selected
The 2nd line: string: “index count”
The 3rd line and the rest:
Column 1: variable index
Column 2: the count of a variable appearing as a splitting variable in the forest
.pre file records the prediction results of the test datasets.
The format is described below:
The 1st line: the number of samples for prediction
The 2nd line through the end: the predicted responses. For a forest, one prediction is made for each sample. For multiple trees (RTREE), one prediction is made by each of the trees and organized in one column.
l
Executables
n Willows_Windows.exe: Self-extracting archive for installation on Windows system.
n Willows_Linux32.tar.gz: GZIP-compressed Tar file
containing executable on 32-bit Linux system. (The javax.swing
package is required to be installed in your system before running “Willows”.)
n Willows_Linux64.tar.gz: GZIP-compressed Tar file
containing executable on 64-bit Linux system. (The javax.swing
package is required to be installed in your system before running “Willows”.)
l
Sample Files
n sample.txt: A sample for the input file.
n sample_SNP.txt: A sample for inputting SNP data.
n test.txt: A sample for prediction.
n test_SNP.txt: A sample for prediction with SNP data.
n rtree_sample.txt.inf: Classification tree output file - data description and tree structure.
n rtree_test.txt.pre: Classification tree output file - prediction results.
n rtree_SNP_sample_SNP.txt.inf: Classification tree output file with SNP data - data description and tree structure.
n rtree_SNP_test_SNP.txt.pre: Classification tree output file with SNP data - prediction results.
n rforest_sample.txt.inf: Random forest output file - data description and tree structure.
n rforest_sample.txt.importance: Random forest output file - variable importance score.
n rforest_test.txt.pre: Random forest output file - prediction results.
n rforest_SNP_sample_SNP.txt.inf: Random forest output file with SNP data - data description and tree structure.
n rforest_SNP_sample_SNP.txt.importance: Random forest output file with SNP data - variable importance score.
n rforest_SNP_test_SNP_.txt.pre: Random forest output file with SNP data - prediction results.
n dforest_sample.txt.inf: Deterministic forest output file - data description and tree structure.
n dforest_sample.txt.count: Deterministic forest output file - variable count.
n dforest_test.txt.pre: Deterministic forest output file - prediction results.
n dforest_SNP_sample_SNP.txt.inf: Deterministic forest output file with SNP data - data description and tree structure.
n dforest_SNP_sample_SNP.txt.count: Deterministic forest output file with SNP data - variable count.
n dforest_SNP_test_SNP.txt.pre: Deterministic forest output file with SNP data - prediction results.
a. Running
Willows with user-friendly graphical interface on Windows or Linux.
1.
After downloading the executables for either
Windows or Linux according to your operating system, and extracting all files
from the archive, double click on ForestApplication.jar to
start the program.
Please note the resolution should be set to a minimal resolution
of 1280*1024 to achieve a quality view.
2.
To analyze a dataset, click on “Browse” Button to
add an input file.
Please
note that if you are running “Willows” with the GUI in Linux, please make sure
there is no white space in the file names or paths. White spaces in file names
and paths will cause a “Wrong number of parameters” error.

3.
After adding the input file into the project, you
can select a classification algorithm: Rtree for
classification tree, Dforest for deterministic
forest, and Rforest for random forest. To use the
memory compression version for SNP data, the “Compressing SNPs” box
should be checked. Depending on the selected classification method, different
inputs are needed.

4. If the prediction box is checked, the predicting data files must also be added. Use the “Browse” button in step 2 to select files, and use the “Add” button to add them into the list. If you want to change the list, please uncheck the prediction box and then check it again, which will clear the list.

5. Click “Run” button to perform the computation. At any time you can click the “Stop” button to stop the calculation.
6. After the computation is complete, “Willows” displays the results in new windows.
i. Trees structure
Click “<<”, “<”, “>”, “>>” to go to the first tree, the previous tree, the next tree, and the last tree, respectively. Users can also enter a specific tree number and browse it. The “Save” button saves the current tree structure into a png file. The following window shows the first six layers of a tree, which is in general the most important part of the tree.
Internal and terminal nodes are represented by ellipsoids and rectangles, respectively. The frequency counts of the response are displayed inside each node, and the splitting variable and the corresponding thresholds are provided for internal nodes.

ii. Prediction results.
If the prediction checkbox is checked, predicted responses will be produced. Click “<<”, “<”, “>”, and “>>” to review the results for the first, the previous, the next and the last test data set, respectively.

iii. Importance scores.
If the random forest algorithm is selected, the out-of-bag error and the variable importance score will be computed.

iv. Frequency count.
If the deterministic forest algorithm is used, the frequency that each predictor is used for splitting in all trees is counted.

7. A user can click on “View Result Files” to open the dialog that contains the options to display saved output files.

b. Running
“Willows” from the command line in Windows.
After downloading the executables for
Windows and extracting all files from the archive, there are six executable
files: rtree.exe, rtree_SNP.exe, rforest.exe, rforest_SNP.exe, dforest.exe and dforest_SNP.exe in
the folder, representing the classification tree, random forest, deterministic
forest methods and their memory compression versions, respectivelly.
To run any of these programs, open the command line window, go to the folder where the programs and files are saved, and run the executable program. The details of the parameters are listed below.
Please note that if the files are not located in the same folder as the executable, use the full path instead of just the file names. Please make sure if there is white space in the file names or paths, the names or paths should be quoted.
|
exe file |
parameter |
|
rtree rtree_SNP |
-t input file name -s 1 (default, for entropy impurity) or 2 (gini impurity) -p test file name -r a positive number less than or equal to 1 representing the threshold to remove a variable according to the percentage of missing observations (default, all variables are included). -minnode a positive number representing minimum size of
terminal nodes (default 1). |
|
rforest rforest_SNP |
-t input file name -s 1 (default, for entropy impurity) or 2 (gini impurity) -p test file name -r a positive number less than or equal to 1 representing the threshold to remove a variable according to the percentage of missing observations (default, all variables are included). -n the number of trees in a forest (default 1000) -m the number of allowable variables to split a node (default log2(number of variables)) -minnode a positive number representing minimum size of terminal nodes (default 1). |
|
dforest dforest_SNP |
-t input file name -s 1 (default, for entropy impurity) or 2 (gini impurity) -p test file name -r a positive number less than or equal to 1 representing the threshold to remove a variable according to the percentage of missing observations (default, all variables are included). -o the number of top splits for the root node (default 20) -n the number of top splits for the internal nodes (the root node excluded) (default 3). -minnode a positive number representing minimum size of terminal nodes (default 1). |
c. Running
“Willows” from the command line in Linux.
After downloading the executables for Linux and extracting all
files from the archive, there are six executable files: rtree, rtree_SNP, rforest, rforest_SNP, dforest and dforest_SNP in
the folder, representing the classification tree, random forest, deterministic
forest methods and their memory compression versions, respectively. The input
parameters are the same as those described above.
Please
note that if the files are not located in the same folder as the executables,
use the full path instead of just the file names. Please make sure there is no
white space in the file names or paths. White spaces in file names and paths
will cause a “Wrong number of parameters” error.