java - Can we customize the InputSplit size for the FileInputFormat -
lets consider mapreduce job spawns 1000 map tasks. block size: 128mb minimum split size : 1mb maximum split size : 256mb
the block size seems limiting value. can increment split size beyond block size?
this function fileinputformat.java
protected long computesplitsize(long goalsize, long minsize, long blocksize) { homecoming math.max(minsize, math.min(goalsize, blocksize)); }
based on above function minimum split size greater block size want. can throw lite on side effects of setting minimum split size in way?
for have understand goalsize
refers total input size divided jobconf.getnummaptasks()
. computation means that:
minsize
. a split no larger lesser of goalsize
, blocksize
. with in light, can understand ideal split size 1 block size, allows framework provide info locality task processes split. (source: pro hadoop)
if want increment split size beyond block size, means each mapper need remote reads read info not local, might less efficient. unless you're trying create huge splits, uncertainty have critical impact on performance. still advise whenever possible maintain default split size, unless have solid utilize case won't work.
java hadoop mapreduce
No comments:
Post a Comment