Sunday, 15 June 2014

java - Can we customize the InputSplit size for the FileInputFormat -



java - Can we customize the InputSplit size for the FileInputFormat -

lets consider mapreduce job spawns 1000 map tasks. block size: 128mb minimum split size : 1mb maximum split size : 256mb

the block size seems limiting value. can increment split size beyond block size?

this function fileinputformat.java

protected long computesplitsize(long goalsize, long minsize, long blocksize) { homecoming math.max(minsize, math.min(goalsize, blocksize)); }

based on above function minimum split size greater block size want. can throw lite on side effects of setting minimum split size in way?

for have understand goalsize refers total input size divided jobconf.getnummaptasks(). computation means that:

a split no smaller remaining info in file or minsize. a split no larger lesser of goalsize , blocksize.

with in light, can understand ideal split size 1 block size, allows framework provide info locality task processes split. (source: pro hadoop)

if want increment split size beyond block size, means each mapper need remote reads read info not local, might less efficient. unless you're trying create huge splits, uncertainty have critical impact on performance. still advise whenever possible maintain default split size, unless have solid utilize case won't work.

java hadoop mapreduce

No comments:

Post a Comment