R vs Pentaho Spoon as an ETL tool -
background (sorry it's long):
i've been tasked maintaining etl collects variety of online advertising data, around 20-30 mbs day, , appends tables in mysql. outside contractors built etl pentaho spoon (kitchen, kettle?). etl consists of 250 jobs , transformations (.ktr,.kjb), each 5 25 steps. mutual going wrong in big process. i've found writing r scripts transform , load much more efficient. in fact, think etl reduced under 1000 lines of code besides calls rmysql (i.e. plyr!). perhaps python used extract info web.
my utilize of r has led resistance. computer programmers designed etl don't know r couldn't called if leave, , lot of time invested in spoon etl. also, layman can more follow steps visually in spoon, in r scripts. part, think getting bogged downwards etl. however, don't have big in matter don't have background in computer science.
please comment if have insights on following. please know have been researching months , have read many opinions, nil concise or reliable provides:
r has been called not scalable @ company. think opposite because of logging capabilities. spoon has limited pure logging output, whereas r scripts can sinked daily log. fixing , avoiding mistakes in .ktrs tedious, easy setting flags and/or searching through r log. thoughts on this?
this leads big image question. point of etls pentaho? post do need etl?, leads me believe if utilize r or other so-called ool, there no reason have tool pentaho. can please confirm if so? need sec sentiment here. if uses tools pentaho? people without programming background, or else? see fair amount of pentaho questions on so.
it true lot more people utilize r , pentaho, right? http://www.kdnuggets.com/2012/05/top-analytics-data-mining-big-data-software.html makes so. honest surprised pentaho 5th, makes me doubly wonder uses pentaho , if doubts it's utilize in work setting misplaced.
thanks responses. don't mean condescension towards spoon or spoon users; confused , in need of outside opinions.
r etl tool? thats new one, whatever floats boat.
i though, if can 250 jobs , transformations downwards under 1000 lines of r etl poorly written.
along have think supportability , scalability. both of imagine far easier graphical tool spoon rather r code.
personally think misguided , question inquire poorly written thats different argument.
regarding points, pdi's logging , can log pretty much like, 1 big database table if consolidated log.
etl's wont going away, advent of love of unstructured info storage pools hdfs, think info analysis done outside r, if want reporting or olap on top of data, still need transforming regardless.
is true, more people utilize r vs pentaho? sort of question that? pentaho assume mean pdi? how can ever compared? info analysis tool vs etl tool , want count users? eh? if on other hand mean r vs pentaho whole, guess no.you looking @ study on r vs weka , making fit etl argument. doesn't wash in month of sundays.
==edit== okay have around 1000 lines of r & python code currently. bosses requirements expand grows on time, , because trying nail deadlines new code written cleanly or documented code have in place. on time grows 5000 lines plus few python scripts. 1 day nail bus, , new person has come in , manage code... start, how create changes?
virtually modicum of info experience create alter pdi etl should required to. take plenty in depth r knowledge create changes have done.
etl tools designed quick , easy use, offer far more r can provide in terms of info connectivity different systems (non db or file based, example), although guess why people resort python etc. said there room both, there r plugin pdi kicking around in community i've seen demonstrated.
on top of i've seen plenty tsql etl migrations on years know experience, though maintaining etl in code may seem practical in short term, in long term brings more pain.
on other hand if can code 250 pdi transformations downwards 1000 lines of r, etl bloated through bad design predecessor.
if you'd me give sentiment on existing pdi etl structure, can arranged.
tom
r pentaho
No comments:
Post a Comment