X. Chen, C.-T. Liu, M. Zhang and H.P. Zhang. A forest-based approach to identifying gene and gene-gene interactions, PNAS, 104: 1919919203, 2007.

To Perform an Analysis:

  1. Apply the recursive classification tree program (rtree) using the individual SNPs as features and the disease status as the outcome. The description of the program and sample files are provided at our rtree page.
  2. Construct haplotype blocks containing the SNPs identified via rtree using Hapview. Please refer to Hapview (http://www.broad.mit.edu/mpg/haploview/download.php) for more information.
  3. Use SNPHAP to estimate the haplotype frequencies in the haplotype blocks identified in the previous step. Please see SNPHAP (http://www-gene.cimr.cam.ac.uk/clayton/software/) for details.
  4. Use HapForest to identify haplotypes and haplotype-haplotype interactions in association with the disease. The program can be downloaded here.
    1. Usage and Input files

      To invoke HapForest from command line, enter java-jar toRun.jar response_file hap_file1 hap_file2... in the installation folder.

      The response_file is a file specifying the response (disease status) of each subject, in which1 stands for affected and 0 for unaffected. A sample response_file can be found here.

      The hap_file1 and hap_file2 and etc each corresponds to the haplotype configuration of a region, output from SNPHAP. The order of the subject in these files should be same as that of the response_file. A sample hap_file is provided here. The number of hap_files depends on the number of haplotype blocks identified in the previous steps.

    1. Output files

      Two types of output files are generated from the program.

      Real_out.txt is a file containing the haplotypes identified by HapForest. The first n rows in the file are dedicated to the haplotypes identified from the real data, where n is the number of haplotypes. For each haplotype, we list the haplotype block where the haplotype is from, its haplotype value, its importance value and its p-value. The remainder of the file lists maximum importance value of haplotypes from each permuted case. A sample Real_out.txt file with 5000 permuted cases can be found here.

      The configuration files with the names hapfile 1_config.txt and hapfile 2_config.txt and etc contain the relative orders of the haplotypes as they appear in Real_out.txt. For each case, real or permuted, the importance values of the haplotypes listed in hapfile 2_config.txt are appended after the importance values of the haplotypes in hapfile 1_config.txt and so on. A sample configuration file is provided here. Again the number of configuration files depends on the number of haplotype blocks.