winapi - How can I speed up my code coverage tool? -
i've written little code coverage utility log basic blocks nail in x86 executable. runs without source code or debugging symbols target, , takes lost of basic blocks monitors.
however, becoming bottleneck in application, involves repeated coverage snapshots of single executable image.
it has gone through couple of phases i've tried speed up. started off placing int3 @ start of each basic block, attaching debugger, , logging hits. tried improve performance patching in counter block bigger 5 bytes (the size of jmp rel32). wrote little stub ('mov [blah], 1 / jmp backtothebasicblockwecamefrom') in process memory space , patch jmp that. speeds things up, since there's no exception , no debugger break, i'd speed things more.
i'm thinking of 1 of following:
1) pre-instrument target binary patched counters (at moment @ runtime). create new section in pe, throw counters in it, patch in hooks need, read info out of same section debugger after each execution. that'll gain me speed (about 16% according estimation) there still pesky int3's need have in smaller blocks, going cripple performance.
2) instrument binary include own unhandledexceptionfilter , handle own int3's in conjunction above. mean there's no process switch debuggee coverage tool on every int3, there'd still breakpoint exception raised , subsequent kernel transition - right in thinking wouldn't gain me much performance?
3) seek clever using intel's hardware branch profiling instructions. sounds pretty awesome i'm not clear on how i'd go - possible in windows usermode application? might go far write kernel-mode driver if it's straightforward i'm not kernel coder (i dabble bit) , cause myself lots of headaches. there other projects using approach? see linux kernel has monitor kernel itself, makes me think monitoring specific usermode application difficult.
4) utilize off-the-shelf application. it'd need work without source or debugging symbols, scriptable (so can run in batches), , preferably free (i'm pretty stingy). for-pay tools aren't off table, (if can spend less on tool , increment perf plenty avoid buying new hardware, that'd justification).
5) else. i'm running in vmware on windows xp, on old hardware (pentium 4-ish) - there i've missed, or leads should read on? can jmp rel32 downwards less 5 bytes (and grab smaller blocks without need int3)?
thanks.
if insist on instrumenting binaries, pretty much fastest coverage 5-byte jump-out jump-back trick. (you're covering standard ground binary instrumentation tools.)
the int 3 solution involve trap. yes, handle trap in space instead of debugger space , speed it, never close competitive jump-out/back patch. may need backup anyway, if function instrumenting happens shorter 5 bytes (e.g., "inc eax/ret") because don't have 5 bytes can patch.
what might optimize things little examine patched code. without such examination, original code:
instrn 1 instrn 2 instrn n next:
patched, in general this:
jmp patch xxx next:
has have patch:
patch: pushf inc count popf instrn1 instrn2 instrnn jmp
if want coverage, don't need increment, , means don't need save flags:
patch: mov byte ptr covered,1 instrn1 instrn2 instrnn jmp
you should utilize byte rather word maintain patch size down. should align patch on cache line processor doesn't have fetch 2 cache lines execute patch.
if insist on counting, can analyze instrn1/2/n see if care flags "inc" fools with, , pushf/popf if needed, or can insert increment between 2 instructions in patch don't care. must analyzing these extent handle complications such instn beingness ret anyway; can generate improve patch (e.g., don't "jmp back").
you may find using add count,1 faster inc count because avoids partial status code updates , consequent pipeline interlocks. impact cc-impact-analysis bit, since inc doesn't set carry bit, , add does.
another possibility pc sampling. don't instrument code @ all; interrupt thread periodically , take sample pc value. if know basic blocks are, pc sample anywhere in basic block evidence entire block got executed. won't give precise coverage info (you may miss critical pc values), overhead pretty low.
if willing patch source code, can better: insert "covered[i]=true;" in origin ith basic block, , allow compiler take care of various optimizations. no patches needed. cool part of if have basic blocks inside nested loops, , insert source probes this, compiler notice probe assignments idempotent respect loop , lift probe out of loop. viola, 0 probe overhead within loop. more more want?
winapi testing x86 code-coverage
No comments:
Post a Comment