Comparison estimating of classification error rate in decision tree: Data mining

Yousef M. T. El Gimati *

Statistics Department, Faculty of Science University of Benghazi, Libya.
 
Research Article
Global Journal of Engineering and Technology Advances, 2021, 07(02), 067–082.
Article DOI: 10.30574/gjeta.2021.7.2.0068
Publication history: 
Received on 06 April 2021; revised on 09 May 2021; accepted on 12 May 2021
 
Abstract: 
Decision Tree (DT) typically splitting criteria using one variable at a time. In this way, the final decision partition has boundaries that are parallel to axes. An observation is misclassified when it falls in a region which does not have the same class membership. Misclassification rate in classification tree is defined as the proportion of observations classified to the wrong class while in the regression tree is defined as a mean squared error. In this paper, we present two of the important methods for estimating the misclassification (error) rate in decision trees, as we know that all classification procedures, including decision trees, can produce errors.
Constructed DT model by using a training dataset and tested it based on an independent test dataset. There are several procedures for estimating the error rate of decision tree-structured classifiers, as K-fold cross-validation and bootstrap estimates. This comparison aimed to characterize the performance of the two methods in terms of test error rates based on real datasets. The results indicate that 10-fold cross-validation and bootstrap yield a tree fairly close to the best available measured by tree size.
 
Keywords: 
Cross-validation; Bootstrap; Misclassification; Training error; Test error; Tree size
 
Full text article in PDF: