Willows

Willows is a software package that includes three classifiers: classification tree, random forest, and deterministic forest. This package is built on the basis of Heping Zhang's RTREE program with two distinctive features. First, the cumulation of data on single nucletide polymorphisms (SNPs) has created data so huge that we have to take specific steps to improve the memory use of the existing software. Willows implements the most efficient memory use for SNP data, while maintaining its general functionality. The second important feature of Willows is a friendly graphical user interface.

Citation

  1. H Zhang, M Wang and X Chen. Willows: A Memory Efficient Tree and Forest Construction Software. BMC Bioinformatics. 10:130, 2009. 
  2. H Zhang and B Singer. Recursive Partitioning in Health Sciences. Springer, 1999. 
  3. H Zhang, C Yu, and B Singer, B. Cell and Tumor Classification using Gene Expression Data: Construction of Forests. Proceedings of the National Academy of Sciences USA, 100: 4168-4172, 2003. 
  4. L Breiman L. Random forests. Machine Learning, 45: 5-32, 2001.

Back to top

Condition of Use

Willows can be used and distributed free of charge under the following terms and conditions. 

Any research using this program or the methods or ideas behind it should acknowledge the use of Willows, and cite at least the first reference in the Citation section. 

THIS SOFTWARE IS PROVIDED BY THE COLLABORATIVE CENTER FOR STATISTICS IN SCIENCE AT YALE UNIVERSITY AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COLLABORATIVE CENTER FOR STATISTICS IN SCIENCE AT YALE UNIVERSITY BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Back to top

Versions

Willows 1.1 - November 20, 2009

Methodology

Recursive partition is the general algorithm behind the tree and forest construction has been discussed in detail by Zhang and Singer (1999), Breiman (2001), and Zhang et al. (2003). Here, we focus on the most unique feature of Willows; namely, the most efficient use of memory for SNP data. In genetic studies, a SNP-based genotype has four possible choices: 0 (AA), 1 (AB), 2 (BB) or 3 (missing). Each choice can be represented by 2 bits. Thus, 16 genotypes can be packed into one integer data type (4 bytes) in Java or C++ using bit shift operators. The theoretical compression ratio is 4:1 compared to a byte storage scheme, and 32:1 compared to a double storage scheme.

Back to top

Input File Formats

(I) Main input file

Willows supports input files in the same simple format as in RTREE: the first line indicates the variable type (response, nominal or ordinal) with no particular order. R or r implies the response; D or "d" implies deleting that covariate from the analysis; O or "o" implies a continuous or an ordinal covariate; N or "n" implies nominal covariate. The columns should be separated by spaces or tabs (any number of these is allowed). In the predictors matrix, NA or . represent missing value. Response can only be integers, missing value in response column is not allowed. An improvement from the RTREE is that it reads an extra line for the name of each covariate. 

We explain this file using the sample.txt for the general classification purpose, and sample_SNP.txt for the use involving SNPs as predictors. The format of sample_SNP.txt is the same as sample.txt, except that the SNP predictors must be coded as 0, 1, or 2 if available, and NA or . (without the quotes) if it is missing.

The first row in sample.txt and sample_SNP.txt is

r o o o o o o o o o o o o o o o o o o o o

The first letter r means that the first column represents the response, and the other o specifies that all predictors are ordinal and none of the predictors will be excluded.

The second row in sample.txt and sample_SNP.txt is

a a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 ...

The first letter a is the name for the response column, and a1 specifies the name of each covariate.

The data begin from the second row. Specifically, the second row in sample.txt is as follows:

0 0.2006632 0.588495 0.5081931 0.1813249 ...

and the second row in sample_SNP.txt is as follows:

0 1 0 1 0 0 2 1 1 1 1 2 2 0 0 2 1 2 0 1 1

The remaining rows are organized similarly.

(II) Additional input file

To use the prediction function that predicts an outcome based on the predictors, additional data files are needed as inputs for prediction.

The format of additional input files is the same as the main input file with one exception. The response column is not required. If it is included, it will be ignored for the prediction purpose.

Back to top

Output file formats

All output files are all saved in local drives for future review and analysis. They can be imported back to Willows at any time.

(I) .inf file.

The .inf file describes the datasets and the tree structures.

Part I:  Summary of the data set

It describes the number of covariates, the types of the covariates, the frequencies or ranges of the covariates

Part II: Information for tree reconstruction

Each tree is described in by one row. In each row, the 1st number is the serial ID of the tree. The remaining numbers describe the nodes in the tree.

An internal node is characterized as follows: (1) the node number within the tree (node 1 is the root node); (2) the sequence number for the left daughter node (when there is no daughter node, the node is terminal, and in this case, 0 is used to indicate a terminal node); (3) the sequence number of the splitting variable; and (4) the corresponding splitting value (a numerical splitting value means that the splitting variable is ordinal and a set of integers means that the splitting variable is nominal).

A terminal node is characterized as follows: (1) the node number within the tree (node 1 is the root node); (2) 0 (because it has no daughter node); and (3) the frequency counts of the response variable (one for each level of the response).

(II) .importance file

The .importance file records the out-of-bag error and the raw importance scores of each variable calculated by the random forest method.

The format is as follows:

The 1st line: the out of bag error rate.

The 2nd line: the number of samples

The 3rd line: string: index importance

The 4th line and the rest:

Column 1: variable index

Column 2: importance score

(III) .count file.

The .count file records the frequency counts that each predictor is used for splitting in all trees of a deterministic forest.

The format is as follows:

The 1st line: number of variables selected

The 2nd line: string: index count

The 3rd line and the rest:

Column 1: variable index

Column 2: the count of a variable appearing as a splitting variable in the forest

(IV) .pre file.

.pre file records the prediction results of the test datasets.

The format is described below:

The 1st line: the number of samples for prediction

The 2nd line through the end: the predicted responses. For a forest, one prediction is made for each sample. For multiple trees (RTREE), one prediction is made by each of the trees and organized in one column.

Back to top

Downloads

Executables

  • Willows_Windows.exe: Self-extracting archive for installation on Windows system.
  • Willows_Linux32.tar.gz: GZIP-compressed Tar file containing executable on 32-bit Linux system. (The javax.swing package is required to be installed in your system before running Willows.)
  • Willows_Linux64.tar.gz: GZIP-compressed Tar file containing executable on 64-bit Linux system. (The javax.swing package is required to be installed in your system before running Willows.)

Sample Files

Back to top

Running Willows

a. Running Willows with user-friendly graphical interface on Windows or Linux.

1. After downloading the executables for either Windows or Linux according to your operating system, and extracting all files from the archive, double click on Willows_win.jar to start the program. 

2. To analyze a dataset, click on File menu and click on load file to add an input file. 

Please note that if you are running Willows with the GUI in Linux, please make sure there is no white space in the file names or paths. White spaces in file names and paths will cause a Wrong number of parameters error.

3. After adding the input file into the project, you can select a classification algorithm from the Algorithm Menu: Rtree for classification tree, Dforest for deterministic forest, and Rforest for random forest. To use the memory compression version for SNP data, the Compressing SNPs box should be checked. Depending on the selected classification method, different inputs are needed.
4. If the prediction box is checked, the predicting data files must also be added. Use the Load File button from File Menu in step 2 to select files, and use the Add button to add them into the list. If you want to change the list, please uncheck the prediction box and then check it again, which will clear the list.

Click Run button to perform the computation. At any time you can click the Stop button to stop the calculation. 

6. After the computation is complete, Willows displays the results in new windows. 

i. Trees structure 

Click <<, <, >, >> to go to the first tree, the previous tree, the next tree, and the last tree, respectively. Users can also enter a specific tree number and browse it. The Save button saves the current tree structure into a png file. The following window shows the first six layers of a tree, which is in general the most important part of the tree. 

Internal and terminal nodes are represented by ellipsoids and rectangles, respectively. The frequency counts of the response are displayed inside each node, and the splitting variable and the corresponding thresholds are provided for internal nodes.

ii. Prediction results. If the prediction checkbox is checked, predicted responses will be produced. Click <<, <, >, and >> to review the results for the first, the previous, the next and the last test data set, respectively.
iii. Importance scores. If the random forest algorithm is selected, the out-of-bag error and the variable importance score will be computed.
iv. Frequency count. If the deterministic forest algorithm is used, the frequency that each predictor is used for splitting in all trees is counted.
7. A user can click on View previous result to open the dialog that contains the options to display saved output files.

Back to top

b. Running Willows from the command line in Windows.

After downloading the executables for Windows and extracting all files from the archive, there are six executable files: rtree.exe, rtree_SNP.exe, rforest.exe, rforest_SNP.exe, dforest.exe and dforest_SNP.exe in the folder, representing the classification tree, random forest, deterministic forest methods and their memory compression versions, respectively. 

To run any of these programs, open the command line window, go to the folder where the programs and files are saved, and run the executable program. The details of the parameters are listed below. 

Please note that if the files are not located in the same folder as the executable, use the full path instead of just the file names. Please make sure if there is white space in the file names or paths, the names or paths should be quoted.

exe fileparameter
rtree
rtree_SNP
-t input file name
-s 1 (default, for entropy impurity) or 2 (gini impurity)
-p test file name
-r a positive number less than or equal to 1 representing the threshold to remove a variable according to the percentage of missing observations (default, all variables are included).
-minnode a positive number representing minimum size of terminal nodes (default 1).
rforest
rforest_SNP
-t input file name
-s 1 (default, for entropy impurity) or 2 (gini impurity)
-p test file name
-r a positive number less than or equal to 1 representing the threshold to remove a variable according to the percentage of missing observations (default, all variables are included).
-n  the number of trees in a forest (default 1000)
-m  the number of allowable variables to split a node (default log2(number of variables))
minnode a positive number representing minimum size of terminal nodes (default 1).
dforest
dforest_SNP
-t input file name
-s 1 (default, for entropy impurity) or 2 (gini impurity)
-p test file name
-r a positive number less than or equal to 1 representing the threshold to remove a variable according to the percentage of missing observations (default, all variables are included).
-o the number of top splits for the root node (default 20)
-n  the number of top splits for the internal nodes (the root node excluded) (default 3).
minnode a positive number representing minimum size of terminal nodes (default 1).

Back to top

c. Running Willows from the command line in Linux.

After downloading the executables for Linux and extracting all files using command: tar xvf Willows_Linux32.tar.gz or tar xvf Willows_Linux64.tar.gz, there are six executable files: rtree, rtree_SNP, rforest, rforest_SNP, dforest and dforest_SNP in the folder, representing the classification tree, random forest, deterministic forest methods and their memory compression versions, respectively. The input parameters are the same as those described above. 

Please note that if the files are not located in the same folder as the executables, use the full path instead of just the file names. Please make sure there is no white space in the file names or paths. White spaces in file names and paths will cause a Wrong number of parameters error.

Back to top