performance - Any good examples of Pig Accumulator Interface implementation that works? -
i have requirement read millions of records hdfs, enrich them , store them xml file in batch of 10k records per xml file.
i have been experimenting accumulator interface , set pig.accumulative.batchsize 2 testing.
however, method gets invoked "exec()" instead of accumulator's "accumulate" method.
outline of udf class follows:
public class myaccudf extends evalfunc <tuple> implements accumulator <tuple>{ public tuple exec(tuple input) throws ioexception { //.. } public void accumulate(tuple b) throws ioexception { //... } public void cleanup() { //.. } public tuple getvalue() { //.. } }
the accumulator interface not guaranteed exercised every time. book programming pig outlines accumulator interface won't honoured:
whenever possible, pig take utilize algebraic implementation of udf on accumulator. because accumulator helps avoid spilling records disk, not cut down network cost or help balance reducers. if udfs in foreach implement accumulator , @ to the lowest degree 1 not implement algebraic, pig utilize accumulator. if @ to the lowest degree 1 not utilize accumulator, pig not utilize accumulator. because pig has read entire handbag memory pass udf not implement accumulator, there no longer value in accumulator.
your udf have implement logic in both exec()
, accumulate()
. simple illustration of duplication of functionality can found in the count
udf.
performance apache-pig udf
No comments:
Post a Comment