c++ - How do I calculate the value of information gain in order to reduce the floating-point approximation errors? -
i have dataset containing features belong 2 class labels denoted 1 , 2. dataset procedded in order build decision tree: during construction of tree, need calculate info gain find best partitioning of dataset.
let there n1 features associated label 1, , n2 features associated label 2, entropy can calculated next formula:
entropy = - (n1/n)*log2(n1/n) - (n2/n)*log2(n2/n)
, n = n1 + n2
i need calculate 3 values of entropy in order obtain info gain:
entropybefore
, entropy before partitioning of current dataset; entropyleft
, entropy of left split after partitioning; entropyright
, entropy of right split after partitioning. so, info gain equal entropybefore - (s1/n)*entropyleft - (s2/n)*entropyright
, s1 number of features of class 1 belonging split 1, , s2 number of features of class 2 belonging split 2.
how calculate value of info gain in order cut down floating-point approximation errors? when apply above formulas in cases in info gain must zero, calculated value equal little negative value.
update (sample code)
double n = static_cast<double>(this->rows()); // rows count of dataset double entropybefore = this->entropy(); // current entropy (before performing split) bool firstcheck = true; double bestsplitig; each possible split { // ... pair<dataset,dataset> splitpair = split(...,...); double s1 = splitpair.first.rows(); double s2 = splitpair.second.rows(); double entropyleft = splitpair.first.entropy(); double entropyright = splitpair.second.entropy(); double splitig = entropybefore - (s1/n*entropyleft + s2/n*entropyright); if (firstcheck || splitig > bestsplitig) { bestsplitig = splitig; // ... firstcheck = false; } }
if using entropy determine alternative better, need result of comparing 2 entropies , not need actual values, can eliminate calculations.
you have function: entropy(n1, n2, n) -> - n1/n*log2(n1/n) - n2/n*log2(n2/n).
supposing n constant duration of problem, let’s multiply look n:
n1*log2(n1/n)-n2*log2(n2/n)next, separate “/n” out of logarithms:
n1*(log2(n1)-log2(n)) - n2*(log2(n2)-log2(n))and expand:
n1*log2(n1) - n2*log2(n2) - (n1+n2)*log2(n)and simplify:
n1*log2(n1) - n2*log2(n2) - n*log2(n)clearly n*log2(n) constant , not impact whether 1 entropy greater can discard it.
also, multiply ln(2), not alter whether 1 entropy greater another. has effect of changing log2 functions ln functions, , ln may calculated more accurately math library (there reason “natural” logarithm):
e(n1, n2, n) -> - n1*ln(n1) - n2*ln(n2)
this function has fewer operations, may computed more accurately entropy function, , has property (when calculated exactly) e(n1, n2, n) < e(m1, m2, n) iff entropy(n1, n2, n) < entropy(m1, m2, n).
c++ math floating-point floating-accuracy numerical-methods
No comments:
Post a Comment