Hee: objective c - C versus vDSP versus NEON - How could NEON be as slow as C? -

objective c - C versus vDSP versus NEON - How could NEON be as slow as C? -

how neon slow c?

i have been trying build fast histogram function bucket incoming values ranges assigning them value - range threshold closest to. applied images have fast (assume image array of 640x480 300,000 elements) . histogram range numbers multiples (0,25,50,75,100) . inputs float , final outputs integers

i tested next versions on xcode opening new empty project (no app delegate) , using main.m file. removed linked libraries exception of accelerate.

here c implementation: older version plenty of if here final optimized logic. took 11s , 300ms.

int main(int argc, char *argv[]) {   nslog(@"starting");    int sizeofarray=300000;    float* inputarray=(float*) malloc(sizeof(float)*sizeofarray);   int* outputarray=(int*) malloc(sizeof(int)*sizeofarray);    (int i=0; i<sizeofarray; ++i)   {     inputarray[i]=88.5;   }    //assume range [0,25,50,75,100]   int lcd=25;    (int j=0; j<1000; ++j)// time interval   {     (int i=0; i<sizeofarray; ++i)     {         //a 60.5 give 50. 88.5 give 100         outputarray[i]=roundf(inputarray[i]/lcd)*lcd;     }   } nslog(@"done"); }

here vdsp implementation. of tedious floating integer , forth, took 6s! 50% improvement!

//vdsp implementation  int main(int argc, char *argv[])  {    nslog(@"starting");     int sizeofarray=300000;     float* inputarray=(float*) malloc(sizeof(float)*sizeofarray);    float* outputarrayf=(float*) malloc(sizeof(float)*sizeofarray);//vdsp requires matching of input output    int* outputarray=(int*) malloc(sizeof(int)*sizeofarray); //rounded value nearest integere    float* finaloutputarrayf=(float*) malloc(sizeof(float)*sizeofarray);    int* finaloutputarray=(int*) malloc(sizeof(int)*sizeofarray); //to compare apples apples scenarios output      (int i=0; i<sizeofarray; ++i)    {      inputarray[i]=37.0; //this produce final number of 25. on other hand 37.5 produce 50.    }      (int j=0; j<1000; ++j)// time interval    {      //assume range [0,25,50,75,100]      float lcd=25.0f;       //divide lcd      vdsp_vsdiv(inputarray, 1, &lcd, outputarrayf, 1,sizeofarray);       //round nearest integer      vdsp_vfixr32(outputarrayf, 1,outputarray, 1, sizeofarray);       // must convert int float (cannot cast) multiply scalar - step has effect of rounding number nearest lcd.     vdsp_vflt32(outputarray, 1, outputarrayf, 1, sizeofarray);     vdsp_vsmul(outputarrayf, 1, &lcd, finaloutputarrayf, 1, sizeofarray);     vdsp_vfix32(finaloutputarrayf, 1, finaloutputarray, 1, sizeofarray);    }   nslog(@"done"); }

here neon implementation. first play nice! slower vdsp , took 9 sec , 300ms did not create sense me. either vdsp improve optimized neon or doing wrong.

//neon implementation int main(int argc, char *argv[]) { nslog(@"starting");  int sizeofarray=300000;  float* inputarray=(float*) malloc(sizeof(float)*sizeofarray); float* finaloutputarrayf=(float*) malloc(sizeof(float)*sizeofarray);  (int i=0; i<sizeofarray; ++i) {     inputarray[i]=37.0; //this produce final number of 25. on other hand 37.5 produce 50. }    (int j=0; j<1000; ++j)// time interval {     float32x4_t c0,c1,c2,c3;     float32x4_t e0,e1,e2,e3;     float32x4_t f0,f1,f2,f3;      //ranges of histogram buckets     float32x4_t buckets0=vdupq_n_f32(0);     float32x4_t buckets1=vdupq_n_f32(25);     float32x4_t buckets2=vdupq_n_f32(50);     float32x4_t buckets3=vdupq_n_f32(75);     float32x4_t buckets4=vdupq_n_f32(100);      //midpoints of ranges     float32x4_t thresholds1=vdupq_n_f32(12.5);     float32x4_t thresholds2=vdupq_n_f32(37.5);     float32x4_t thresholds3=vdupq_n_f32(62.5);     float32x4_t thresholds4=vdupq_n_f32(87.5);       (int i=0; i<sizeofarray;i+=16)     {         c0= vld1q_f32(&inputarray[i]);//load         c1= vld1q_f32(&inputarray[i+4]);//load         c2= vld1q_f32(&inputarray[i+8]);//load         c3= vld1q_f32(&inputarray[i+12]);//load           f0=buckets0;         f1=buckets0;         f2=buckets0;         f3=buckets0;          //register0         e0=vcgtq_f32(c0,thresholds1);         f0=vbslq_f32(e0, buckets1, f0);          e0=vcgtq_f32(c0,thresholds2);         f0=vbslq_f32(e0, buckets2, f0);          e0=vcgtq_f32(c0,thresholds3);         f0=vbslq_f32(e0, buckets3, f0);          e0=vcgtq_f32(c0,thresholds4);         f0=vbslq_f32(e0, buckets4, f0);            //register1         e1=vcgtq_f32(c1,thresholds1);         f1=vbslq_f32(e1, buckets1, f1);          e1=vcgtq_f32(c1,thresholds2);         f1=vbslq_f32(e1, buckets2, f1);          e1=vcgtq_f32(c1,thresholds3);         f1=vbslq_f32(e1, buckets3, f1);          e1=vcgtq_f32(c1,thresholds4);         f1=vbslq_f32(e1, buckets4, f1);           //register2         e2=vcgtq_f32(c2,thresholds1);         f2=vbslq_f32(e2, buckets1, f2);          e2=vcgtq_f32(c2,thresholds2);         f2=vbslq_f32(e2, buckets2, f2);          e2=vcgtq_f32(c2,thresholds3);         f2=vbslq_f32(e2, buckets3, f2);          e2=vcgtq_f32(c2,thresholds4);         f2=vbslq_f32(e2, buckets4, f2);           //register3         e3=vcgtq_f32(c3,thresholds1);         f3=vbslq_f32(e3, buckets1, f3);          e3=vcgtq_f32(c3,thresholds2);         f3=vbslq_f32(e3, buckets2, f3);          e3=vcgtq_f32(c3,thresholds3);         f3=vbslq_f32(e3, buckets3, f3);          e3=vcgtq_f32(c3,thresholds4);         f3=vbslq_f32(e3, buckets4, f3);           vst1q_f32(&finaloutputarrayf[i], f0);         vst1q_f32(&finaloutputarrayf[i+4], f1);         vst1q_f32(&finaloutputarrayf[i+8], f2);         vst1q_f32(&finaloutputarrayf[i+12], f3);     } } nslog(@"done"); }

ps: first benchmarking on scale tried maintain simple (large loops, setup code constant, using nslog print start/end time, accelerate framework linked). if of these assumptions impacting outcome, please critique.

thanks

first, not "neon" per-se. intrinsics. impossible neon performance using intrinsics under clang or gcc. if think need intrinsics, should hand-write assembler.

vdsp not "better optimized" neon. vdsp on ios uses neon processor. vdsp's utilize of neon much improve optimized utilize of neon.

i haven't dug through intrinsics code yet, (in fact certain) cause of problem you're creating wait states. writing in assembler (and intrinsics assembler written welding gloves on), nil writing in c. don't loop same. don't compare same. need new way of thinking. in assembly can more 1 thing @ time (because have different logic units), absolutely have schedule things in such way things can run in parallel. assembly keeps pipelines full. if can read code , makes perfect sense, it's crap assembly code. if never repeat yourself, it's crap assembly code. need consider going register , @ how many cycles there until you're allowed read it.

if easy transliterating c, compiler you. moment "i'm going write in neon" you're saying "i think can write improve neon compiler," because compiler uses too. said, possible write improve neon compiler (particularly gcc , clang).

if you're ready go diving world (and it's pretty cool world), have reading ahead of you. here's places recommend:

http://www.coranac.com/tonc/text/asm.htm (you want spend time one) http://hilbert-space.de/ (the whole site. stopped writting way soon.) to specific question, explains here: http://hilbert-space.de/?p=22 some more on specific question: arm neon intrinsics vs hand assembly http://wanderingcoder.net/2010/07/19/ought-arm/ http://wanderingcoder.net/2010/06/02/intro-neon/ you may interested in: http://robnapier.net/blog/fast-bezier-intro-701 http://robnapier.net/blog/faster-bezier-722

all said... always start reconsidering algorithm. reply not how create loop calculate quickly, it's how not phone call loop often.

objective-c assembly arm neon vdsp

Hee

Thursday, 15 January 2015

objective c - C versus vDSP versus NEON - How could NEON be as slow as C? -

No comments:

Post a Comment