Budget allocation on differentially private decision trees and random forests

Publication Type:
Thesis
Issue Date:
2018
Full metadata record
Privacy-preserving techniques are necessary to minimize the possibility of identifying and learning sensitive information about individuals from any datasets that have been released or shared. Datasets containing sensitive information on individuals are becoming increasingly public. Although this will support data mining research, information privacy concerns need to be addressed. Many techniques have been developed to preserve the privacy of sensitive public data, with Differential Privacy (DP) being a leading method. Differential privacy, as one of the most important privacy-preserving mechanisms, typically adds noise to prevent the disclosure of individuals’ sensitive data. However, adding too much noise can compromise the utility of the output from this mechanism, because the resulting predictions will be too inaccurate. Several techniques can be used to achieve differential privacy in practice, including adding Laplacian noise or Gaussian noise to the output of the computation, or adapting perturbation techniques. Each one of these techniques has been proposed to be applied to the right data mining process and applications. Finding a balance between the privacy of individuals and the utility of the results is one of the biggest challenges in differential privacy. Differential privacy has received significant attention in the machine learning and data mining communities, in part because the individual privacy guarantee does not fundamentally contradict the learning goal of generalizing well (C. Dwork & Roth, 2014). In supervised learning, where the output of the privacy preservation mechanism is a classifier and the mechanism itself involves a learning algorithm, the minimization of the classification error under privacy constraints is not trivial (Chaudhuri, Monteleoni, & Sarwate, 2011; Friedman & Schuster, 2010). Decision tree induction is a textbook case for such a problem. The algorithmic decisions in trees that are made with privacy considerations, have a deep impact on the accuracy of the results. An improved privacy-preserved decision tree algorithm, with an order of magnitude of fewer learning samples, can still achieve the same level of accuracy and privacy as the naive implementation. Although adding Laplacian noise is still one of the main mechanisms to achieve differential privacy in decision trees, a recent line of work has shown that the exponential mechanism leads to better trade-offs between accuracy and privacy on the splitting points of top-down decision tree induction algorithms such as ID3, C4.5 and CART (Blum, Dwork, McSherry, & Nissim, 2005; Friedman & Schuster, 2010). These studies reveal a striking observation: while the splitting criterion of ID3/C4.5 guarantees a better rate of convergence in classification than CART, due to the requirement of fewer interactive queries to reach a given accuracy level, the privacy cost it incurs for a given accuracy level is comparatively larger. This observation is important because the algorithms are greedy; ID3 and C4.5 require fewer interactive queries to reach a given accuracy, but each query incurs a higher privacy cost than is the case for CART. Many machine learning and minimization tasks, including decision trees, make use of an objective function. At each iteration of a decision tree, a parameter set is defined, and the objective function returns some scoring value that reflects how good or bad that parameter set is. Then the parameter set is altered, and the process repeats. The process of induction is stopped when the changes in fit become acceptably small and the tree is getting closer to a local or global optima. These stopping rules comprise the rate of convergence of a tree. When an algorithm converges, it has found a parameter set meeting the optimal requirements. From the privacy standpoint, the faster the rate of convergence of a splitting function, the higher the privacy cost it incurs. This inevitably raises the question of whether some splitting functions can achieve two goals simultaneously: fast convergence rates and a small privacy cost. This motivates the research in this thesis to seek other methods to improve the spending of the privacy budget throughout the induction of the decision tree, such as tuning the allocation of the privacy budget across queries. This thesis presents an adaptive mechanism for allocating the privacy budget, designed for any proper symmetric splitting criterion. In a nutshell, when a tree is splitting nodes, the algorithm probes for the amount of privacy budget to allocate to the split. Some data samples at a node need more privacy budget to build an accurate split, and some are not dependent on spending too much budget. The dynamic budget allocation algorithm will enhance the application of differential privacy on decision trees and forests. This thesis focuses on the dynamic privacy budget allocation algorithm to have more control over the consumption of the budget in the data mining techniques. There are three contributions in this thesis. Contribution 1 proposes a differentially private budget allocation algorithm on decision tree induction. Contribution 2 extends this schema on random forest algorithm. It is important to consider the adaptive allocation of the budget to the queries that need more or less privacy/utility trade-off. The algorithms are evaluated on synthetic and real datasets for decision trees and random forests. The experiments in this thesis show that the accuracy level of these algorithms are competitively high compared to state-of-the-art models. The budget allocation optimization technique has the potential to be a building block for other data mining techniques and methods. Contribution 3 of this thesis introduces a new splitting criterion for decision trees and evaluates its performance in comparison to other splitting criteria. The result of this evaluation demonstrates that the best splitting criteria for classification under the boosting framework are not necessarily the best to learn in a differentially private setting.
Please use this identifier to cite or link to this item: