And I heard there were significant speed improvements, is this true? Any idea when it will be out?Originally Posted by Joe O
Improving, Yes.Originally Posted by Matt
It goes to 2^52, page faults less, has yet to go into a loop or leak memory.
Joe O
I figured out how to resolve the thrashing issue, but I don't understand what causes it. When I pull some of the functions out and put them into a separate file, I am then able to compile with -O3 and get a 5% gain over -O2. Is gcc inlining one of those functions? If so, why is it killing performance so badly. Maybe one of you guys know gcc much better than I and can answer the question.Originally Posted by rogue
Mark,
Check the symbol table addresses of the functions in the external file for starters and see if you are getting page alignments. Conversely, pulling out the functions will allow the functions 'above' and 'below' the removed one(s) to potentially load and be put on the same page and/or fit in cache at the same time. Locality of reference with respect to cache is key.
You have control of inlining with GCC as you know, and it shouldn't inline
functions unless you have a default CFLAGS set to do so or specify it in code or on the command line. Check the '-03' flags set for your processor. It might do an 'auto inline' as you suspect. You can override it of course on the command line.
The X86 version behaves differently, but still is largely 'cache centric'. Cache row size vs memory bus interface is the other. The number of cycles required to get a cache row into the cpu makes a big difference. The AMD and Intel versions have major differences right here. I suspect you are seeing similar.
Email me if you wish and we can discuss off board.
C.
Chuck,
Since Alex (who is aware of my findings) is leading the effort on the PPC port, I'll leave it to him to find a solution. To me it isn't an issue since this workaround works perfectly well.
I am also interested in a version, but then for the Intel Macbook Pros.
Thank you