Tuesday, 15 July 2014

performance - Fast percentile in C++ -



performance - Fast percentile in C++ -

my programme calculates monte carlo simulation value-at-risk metric. simplify much possible, have:

1/ simulated daily cashflows 2/ sample of possible 1-year cashflow, need draw 365 random daily cashflows , sum them

hence, daily cashflows empirically given distrobution function sampled 365 times. this,

1/ sort daily cashflows array called *this->distro* 2/ calculate 365 percentiles corresponding random probabilities

i need simulation of yearly cashflow, say, 10k times population of simulated yearly cashflows work with. having distribution function of daily cashflows prepared, sampling like...

for ( unsigned int idxsim = 0; idxsim < _g.xsimulationcount; idxsim++ ) { generatedval = 0.0; ( register unsigned int idxday = 0; idxday < 365; idxday ++ ) { prob = (flt_type)fastrand(); // prob [0,1] didx = prob * dmaxdistroindex; // scale prob distro function size // index distro array _floor = ((flt_type)(long)didx); // fast version of floor _ceil = _floor + 1.0f; // 'fast' ceil:) iidx1 = (unsigned int)( _floor ); iidx2 = iidx1 + 1; // interpolation per se generatedval += this->distro[iidx1]*(_ceil - didx ); generatedval += this->distro[iidx2]*(didx - _floor); } this->yearlycashflows[idxsim] = generatedval ; }

the code within of both for cycles linear interpolation. if, usd 1000 corresponds prob=0.01, usd 10000 corresponds prob=0.1 if don't have empipirical number p=0.05 want usd 5000 interpolation.

the question: code runs correctly, though profiler says programme spends cca 60% of runtime on interpolation per se. question is, how can create task faster? sample runtimes reported vtune specific lines follows:

prob = (flt_type)fastrand(); // 0.727s didx = prob * dmaxdistroindex; // 1.435s _floor = ((flt_type)(long)didx); // 0.718s _ceil = _floor + 1.0f; // - iidx1 = (unsigned int)( _floor ); // 4.949s iidx2 = iidx1 + 1; // - // interpolation per se generatedval += this->distro[iidx1]*(_ceil - didx ); // - generatedval += this->distro[iidx2]*(didx - _floor); // 12.704s

dashes mean profiler reports no runtimes lines.

any hint appreciated. daniel

edit: both c.fogelklou , msalters have pointed out great enhancements. best code in line c.fogelklou said

converter = distrodimension / (flt_type)(rand_max + 1) ( unsigned int idxsim = 0; idxsim < _g.xsimulationcount; idxsim++ ) { generatedval = 0.0; ( register unsigned int idxday = 0; idxday < 365; idxday ++ ) { didx = (flt_type)fastrand() * converter; iidx1 = (unsigned long)didx); _floor = (flt_type)iidx1; generatedval += this->distro[iidx1] + this->diffs[iidx1] *(didx - _floor); } }

and best have along msalter's lines

normalizer = 1.0/(flt_type)(rand_max + 1); ( unsigned int idxsim = 0; idxsim < _g.xsimulationcount; idxsim++ ) { generatedval = 0.0; ( register unsigned int idxday = 0; idxday < 365; idxday ++ ) { didx = (flt_type)fastrand()* normalizer ; iidx1 = fastrand() % _g.xdaycount; generatedval += this->distro[iidx1]; generatedval += this->diffs[iidx1]*didx; } }

the sec code approx. 30 percent faster. now, of 95s of total runtime, lastly line consumes 68s. lastly 1 line consumes 3.2s hence double*double multiplication must devil. thought of sse - saving lastly 3 operands array , carry out vector multiplication of this->diffs[i]*didx[i] , add together this->distro[i] code ran 50 percent slower. hence, think nail wall.

many all. d.

this proposal little optimization, removing need ceil, 2 casts, , 1 of multiplies. if running on fixed point processor, explain why multiplies , casts between float , int taking long. in case, seek using fixed point optimizations or turning on floating point in compiler if cpu supports it!

for ( unsigned int idxsim = 0; idxsim < _g.xsimulationcount; idxsim++ ) { generatedval = 0.0; ( register unsigned int idxday = 0; idxday < 365; idxday ++ ) { prob = (flt_type)fastrand(); // prob [0,1] didx = prob * dmaxdistroindex; // scale prob distro function size // index distro array iidx1 = (long)didx; _floor = (flt_type)iidx1; // fast version of floor iidx2 = iidx1 + 1; // interpolation per se { const flt_type diff = this->distro[iidx2] - this->distro[iidx1]; const flt_type interp = this->distro[iidx1] + diff * (didx - _floor); generatedval += interp; } } this->yearlycashflows[idxsim] = generatedval ; }

c++ performance percentile

No comments:

Post a Comment