Saturday, 15 August 2015

c++ - How do I calculate the value of information gain in order to reduce the floating-point approximation errors? -



c++ - How do I calculate the value of information gain in order to reduce the floating-point approximation errors? -

i have dataset containing features belong 2 class labels denoted 1 , 2. dataset procedded in order build decision tree: during construction of tree, need calculate info gain find best partitioning of dataset.

let there n1 features associated label 1, , n2 features associated label 2, entropy can calculated next formula:

entropy = - (n1/n)*log2(n1/n) - (n2/n)*log2(n2/n), n = n1 + n2

i need calculate 3 values of entropy in order obtain info gain:

entropybefore, entropy before partitioning of current dataset; entropyleft, entropy of left split after partitioning; entropyright, entropy of right split after partitioning.

so, info gain equal entropybefore - (s1/n)*entropyleft - (s2/n)*entropyright, s1 number of features of class 1 belonging split 1, , s2 number of features of class 2 belonging split 2.

how calculate value of info gain in order cut down floating-point approximation errors? when apply above formulas in cases in info gain must zero, calculated value equal little negative value.

update (sample code)

double n = static_cast<double>(this->rows()); // rows count of dataset double entropybefore = this->entropy(); // current entropy (before performing split) bool firstcheck = true; double bestsplitig; each possible split { // ... pair<dataset,dataset> splitpair = split(...,...); double s1 = splitpair.first.rows(); double s2 = splitpair.second.rows(); double entropyleft = splitpair.first.entropy(); double entropyright = splitpair.second.entropy(); double splitig = entropybefore - (s1/n*entropyleft + s2/n*entropyright); if (firstcheck || splitig > bestsplitig) { bestsplitig = splitig; // ... firstcheck = false; } }

if using entropy determine alternative better, need result of comparing 2 entropies , not need actual values, can eliminate calculations.

you have function: entropy(n1, n2, n) -> - n1/n*log2(n1/n) - n2/n*log2(n2/n).

supposing n constant duration of problem, let’s multiply look n:

n1*log2(n1/n)-n2*log2(n2/n)

next, separate “/n” out of logarithms:

n1*(log2(n1)-log2(n)) - n2*(log2(n2)-log2(n))

and expand:

n1*log2(n1) - n2*log2(n2) - (n1+n2)*log2(n)

and simplify:

n1*log2(n1) - n2*log2(n2) - n*log2(n)

clearly n*log2(n) constant , not impact whether 1 entropy greater can discard it.

also, multiply ln(2), not alter whether 1 entropy greater another. has effect of changing log2 functions ln functions, , ln may calculated more accurately math library (there reason “natural” logarithm):

e(n1, n2, n) -> - n1*ln(n1) - n2*ln(n2)

this function has fewer operations, may computed more accurately entropy function, , has property (when calculated exactly) e(n1, n2, n) < e(m1, m2, n) iff entropy(n1, n2, n) < entropy(m1, m2, n).

c++ math floating-point floating-accuracy numerical-methods

No comments:

Post a Comment