Profile of Soul Tomb Setup

skyjake · 2006 May 20

This post was originally made by skyjake on the dengDevs blog. It was posted under the categories: Engine, jDoom.

I ran Apple's Shark profiling tool on <tt>soultomb-2-1.wad</tt>. It works by periodically interrupting running processes and sampling their internal state, including the function callstack. During the map setup, Shark collected 174157 samples which show the following results.

<pre>+ 92.7% Z_Malloc (Doomsday)
| - 85.6% RIT_ShadowSubsectorLinker (Doomsday)
| - 3.6% P_InitSubsectorBlockMap (Doomsday)
| - 1.8% Z_Calloc (Doomsday)
| - 1.8% R_InitPlanePoly (Doomsday)
| - 0.0% DL_InitLinks (Doomsday)
| - 0.0% Str_Set (Doomsday)
| - 0.0% LG_Init (Doomsday)
- 1.4% ml_set_interrupts_enabled (mach_kernel)
- 1.0% LG_Init (Doomsday)
- 0.6% S_CalcSectorReverbs (Doomsday)
- 0.5% Resampler2::ConvertAltivec() (AudioToolbox)
- 0.3% R_IsPointInSector2 (Doomsday)
- 0.2% P_SubsectorBoxIteratorv (Doomsday)
- 0.2% R_InitLineNeighbors (Doomsday)
</pre>

Observations:
<ul>
<li>Z_Malloc is indeed the function where the most time is being spent.
<li>The majority of Z_Malloc calls (~86%) originate from RIT_ShadowSubsectorLinker (as you suspected, Dani). This is definitely the place where optimization is needed.
<li>A minority of calls come from the subsector blockmap init (3.6%).
<li>LG_Init is quite fast, with only 1% of the setup time. Since the setup took approximately 174 seconds, this means that LG_Init only takes about 2 seconds for this map.
</ul>

<blockquote>I doubt there is a direct correlation between the number of subsectors/shadowpolys and the number of shadowlinks required(?). So I guess we can't allocate all we need in one go so it would be best to allocate in batches of a reasonable number using a pool.</blockquote> I agree. Since shadowlink_t is very small (2 pointers), we can allocate something like 1024 of them at a time. I'm thinking it should go something like this:

<ol>
<li>Allocate the first 1024 batch.
<li>In R_LinkShadow, use the next free shadowlink_t from the batch.
<li>When 1024 links have been used up, allocate the next batch and start using those in R_LinkShadow.
<li>Note that in the end we cannot realloc the last batch to save space because that might change the address of the memory block and thus invalidate any pointers to the links in the last batch.
</ol>

Dani, do you want to take care of this?

DaniJ · 2006 May 20

<blockquote>Dani, do you want to take care of this?</blockquote>
Yep I can take care of this.

I'm thinking it would be useful to have a generic mechanism for this kind of memory allocation functioning like an object. So perhaps we could do something like:
<code>
typedef zblock_s {
int numelementsperblock;
int lastusedelement;
size_t elementsize;
void** arrayptrtoblocks;
} zblock_t;</code>

zblock_t *Z_BlockMalloc(size_t sizeofelement, int batchsize, int purgelevel);
void *Z_BlockNew(zblock_t* block);
void Z_BlockDestroy(zblock_t* block); // Would free the entire thing (change purge tag), all blocks, all elements.

skyjake · 2006 May 20

That is a good idea. I'd prefer Z_BlockFree(), though, instead of Z_BlockDestroy.

Also, instead of lastusedelement, just save the number of used elements. Then the first unused is elements[count] (or rather, <tt>((byte*)elements) + (elementSize * count)</tt>).

<pre>
typedef struct zblock_s {
unsigned int max; // maximum number of elements
unsigned int count; // number of used elements
size_t elementSize; // size of a single element
void* elements;
// block of memory where elements
// are located in (cast to byte to get offset)
} zblock_t;
</pre>

DaniJ · 2006 May 20

Ok I'll get on it.

In your version above I notice we are not retaining ptrs to previously allocated blocks. I take it there is no need (given that ptrs to the objects themselves will be used elsewhere and the memory is from the zone anyway)?

It was interesting to hear the results of LG_Init with this map. I was expecting that to take much longer.

I assume you profiled using the default "rend-bias-grid-multisample 1" (5 samples per block). It would be interesting to compare the results of the same test with the multisample factor set to 7 (65 samples per block) to see how well the current algorithm scales in practice.

skyjake · 2006 May 20

Yes, rend-bias-grid-multisample was at the default of 1. I'll see what 7 will do.

BTW, see my edits of my previous post about zblock_t.

DaniJ · 2006 May 20

Heh, see my edit too

skyjake · 2006 May 20

Why would we want to store pointers to the previously allocated blocks? zblock_t effectively is a zmalloced block itself, so it becomes the responsibility of the caller to make sure it gets treated appropriately as all zmalloced blocks should. In this case, it would fall upon r_shadow.c to keep track of all the blocks it has allocated. There could be another utility structure for this bookkeeping, but it should not be merged with the zblocks themselves. (I'm sure dynamically linked lists containing a set of void* pointers would come in handy in other places as well.)

skyjake · 2006 May 20

How about adding a void* Z_BlockAllocElement(zblock_t* block) for allocated a single element from a zblock?

skyjake · 2006 May 20

Ah, I re-read your earlier comment and understood that you have multiple blocks in one zblock_t. (You are missing a counter for the number of allocated blocks.)

Yes, it would make it easier to use a zblock_t if there is no need for manual bookkeepping. However, I think a zblock_t should really be a simple block of elements. If we don't do a generic linked list system for the bookkeeping, we can do something like:

<pre>
typedef struct zblockset_s {
unsigned int count;
zblock_t* blocks;
} zblockset_t;
</pre>

Then the user would allocate elements from the blockset, and the blockset would add blocks as needed.

DaniJ · 2006 May 20

<blockquote>Why would we want to store pointers to the previously allocated blocks? zblock_t effectively is a zmalloced block itself, so it becomes the responsibility of the caller to make sure it gets treated appropriately as all zmalloced blocks should. In this case, it would fall upon r_shadow.c to keep track of all the blocks it has allocated. There could be another utility structure for this bookkeeping, but it should not be merged with the zblocks themselves. (I'm sure dynamically linked lists containing a set of void* pointers would come in handy in other places as well.)</blockquote>
Good point. I was attempting to merge both functionality into one which in retrospect doesn't seem like a good idea to me either.

The zmalloced blocks shouldn't really be public either. The ideal implementation would encapsulate them using said utility structure to handle the booking and provide the public interface(s) to the zmalloced blocks.

Complex data structures could then be built up using the interface of the utility stucture which intern uses storage in zblock form.
<blockquote>Ah, I re-read your earlier comment and understood that you have multiple blocks in one zblock_t. (You are missing a counter for the number of allocated blocks.)</blockquote>
Opps, yes I forgot to add that vital clue.

DaniJ · 2006 May 20

I think we've just about covered everything. Is there anything else we should/should not do with these?

Should zblock_t be intergrated directly with the memory zone handling? For example, when zmalloced memory pointed to by zblock_t is freed by a tag purge should zblock_t be zeroed out automatically?

skyjake · 2006 May 20

<blockquote>Should zblock_t be intergrated directly with the memory zone handling? For example, when zmalloced memory pointed to by zblock_t is freed by a tag purge should zblock_t be zeroed out automatically?</blockquote> I'm thinking K.I.S.S. is again a good principle here. The fewer changes there are to the core memory block code, the safer we are.

skyjake · 2006 May 20

Here are the results with "rend-bias-grid-multisample 7":
<pre>
- 87.9% Z_Malloc (Doomsday)
+ 4.3% R_IsPointInSector2 (Doomsday)
| - 4.3% LG_Init (Doomsday)
- 1.5% ml_set_interrupts_enabled (mach_kernel)
- 1.2% LG_Init (Doomsday)
- 0.8% R_PointInSubsector (Doomsday)
- 0.6% S_CalcSectorReverbs (Doomsday)
- 0.5% Resampler2::ConvertAltivec() (AudioToolbox)
- 0.3% FixedMul (Doomsday)
- 0.2% P_SubsectorBoxIteratorv (Doomsday)
- 0.2% R_InitLineNeighbors (Doomsday)
</pre>

The LG_Init time definitely increases, from the 1% of the first case to about 5.5% (total). Interestingly, the function that takes up the largest amount of time is R_IsPointInSector2.

DaniJ · 2006 May 20

That is much as I expected.

See my notes about a potential optimization in LG_Init. Currently when "rend-bias-grid-multisample" > 0 there is a considerable number of calls to R_IsPointInSector2 which can be completely avoided. On average 40% of the calls are unnecessary and that figure increases the greater the multisample factor * number of blocks.

We just need to be a bit cleverer and cut out the duplicated calls by doing all the R_IsPointInSector2 work before going into the intialization loop.

Yagisan · 2006 May 20

I built svn 3187 with gcc 4.0 with CFLAGS set to ""-march=i486 -mtune=pentium4 -O2 -ggdb3 -pg". Map was loaded 3 times, first a cold run, second 2 both warm runs, with basically all non-conflicting jpacks installed, with a default jdoom.cfg. To reduce statistical variations, the output from the 3 runs was merged together and analysed with gprof.

The results are too big to paste here, so I'll just link to them.
The flat profile:
http://eyagi.bpa.nu/~jamie/uf/deng-3187/flat_profile.txt
The flat profile, line by line output:
http://eyagi.bpa.nu/~jamie/uf/deng-3187/flat_profile-line_by_line.txt
Call Graph output:
http://eyagi.bpa.nu/~jamie/uf/deng-3187/call_graph.txt
My raw data:
http://eyagi.bpa.nu/~jamie/uf/deng-3187/
Annotated source files:
http://eyagi.bpa.nu/~jamie/uf/deng-3187/annotated_source/

skyjake · 2006 May 20

The gprof results are a tad more difficult to interpret, but the gist seems to be the same: Z_Malloc is taking the bulk of execution time. With Shark I was able to restrict the sampling to just the map setup phase, but I don't think gprof has this feature. Therefore the gprof results probably contain samples from the engine startup as well (?).

DaniJ · 2006 May 20

<blockquote>Therefore the gprof results probably contain samples from the engine startup as well (?).</blockquote>
It looks like it, there are functions listed there which are only called during startup.

In SVN 3189:
CHANGED: Implemented zblocks and zblocksets to be used to replace linear memory allocations when there are an unknown number of elements at the start of the process. Currently only being used for the shadow links.

I would be greatful if someone could give it the once over.

Although we can pretty much guess the outcome (soultomb-2-1.wad MAP06 loads in about 30 seconds for me now with "rend-bias-grid-multisample 1") - I think we'd better do another profile with this new code in place.

If that all works out ok we should have a look to see where else could benefit from this optimization.

skyjake · 2006 May 20

I can see a couple of things that could clarify the zblock code. I'll commit changes soon.

skyjake · 2006 May 20

Revision 3191: revised zblock code. Some notes in the commit log message.

Total setup time reduced from 174 seconds to 18 seconds, with shadow linking simply dropping off the radar:
<pre>
+ 65.4% Z_Malloc (Doomsday)
| - 32.2% P_InitSubsectorBlockMap (Doomsday)
| + 15.9% Z_Calloc (Doomsday)
| | - 15.7% R_InitPlanePoly (Doomsday)
| | - 0.1% DL_InitLinks (Doomsday)
| | - 0.0% R_SetupLevel (Doomsday)
| | - 0.0% LG_Init (Doomsday)
| | - 0.0% R_InitSectorShadows (Doomsday)
| | - 0.0% P_InitSubsectorBlockMap (Doomsday)
| - 15.8% R_InitPlanePoly (Doomsday)
| - 0.7% Z_Recalloc (Doomsday)
| - 0.7% Z_BlockNewElement (Doomsday)
| - 0.0% Z_BlockCreate (Doomsday)
| - 0.0% R_InitVertexOwners (Doomsday)
| - 0.0% DL_InitLinks (Doomsday)
| - 0.0% LG_Init (Doomsday)
- 8.6% LG_Init (Doomsday)
- 6.1% S_CalcSectorReverbs (Doomsday)
+ 3.5% R_IsPointInSector2 (Doomsday)
| - 3.5% LG_Init (Doomsday)
- 1.7% R_InitLineNeighbors (Doomsday)
</pre>

Observations:
<ul>
<li>Now that the single greatest slowdown has been eliminated, there is a wider distribution of functions taking up time.
<li>Z_Malloc still leads, though, so perhaps at least P_initSubsectorBlockMap should also receive some optimization.
<li>The other significant Z_Malloc caller is R_InitPlanePoly (directly and via Z_Calloc).
<li>LG_Init (multisample 1) has risen to 8.6%. This noticeably takes a few seconds during the setup. However, it's still quite a little when compared to Z_Malloc.
<li>Z_BlockNewElement is in the charts with 0.7%, undoubtedly due to the 1024 element max. If the maximum were raised, this would disappear completely (however, 0.7% isn't that bad...).
</ul>

All we need is lightning
With power and might
Striking down the prophets of false

DaniJ · 2006 May 21

<blockquote>Revision 3191: revised zblock code. Some notes in the commit log message.</blockquote>
Yes, your revised version is more elegant. I'm glad to see my implementation wasn't far off the mark though.

Thanks for explaining the changes too btw, it helps me to understand better what is considered good/common practice.
<blockquote>Total setup time reduced from 174 seconds to 18 seconds, with shadow linking simply dropping off the radar:</blockquote>
Excellent.
<blockquote>
<ul>
<li>The other significant Z_Malloc caller is R_InitPlanePoly (directly and via Z_Calloc).</li>
<li>LG_Init (multisample 1) has risen to 8.6%. This noticeably takes a few seconds during the setup. However, it's still quite a little when compared to Z_Malloc.</li>
</ul>
</blockquote>
Now, with multisample 7 - I would expect LG_Init to close the gap between it and Z_Malloc significantly.

Both LG_Init and P_InitSubsectorBlockMap allocate linearly too and perhaps could benefit from using zblocks. Try dropping them in and see how it affects the results.
<blockquote>All we need is lightning
With power and might
Striking down the prophets of false</blockquote>
As the moon is rising
Give us the sign
Now let us rise up in awe

Oddly, I recognised that instantly... good luck Finland

Yagisan · 2006 May 21

<blockquote>Therefore the gprof results probably contain samples from the engine startup as well (?).</blockquote> Yes they do. The entire doomsday was build with profiling instrumentation, and profiling information was captured from the moment of startup to map06 loaded, and I typed quit. Sorry for not repling sooner, I promply fell asleep after posting that (it was around 4am).

What I should do, is play through a few levels on defalt settings (excluding controls) to get a feel for what is the main bottle next. I'll do a run for doom, heretic and hexen.

skyjake · 2006 May 21

Revision 3195: rather radical optimizations to the memory zone: fast malloc mode and PU_LEVELSTATIC sequences.

Fast malloc mode: memory volumes aren't checked for purgable blocks. If the rover block isn't suitable, a new empty volume is created without further checking. This is suitable for cases where lots of blocks are being allocated in a rapid sequence, with no frees in between (e.g., map setup).

PU_LEVELSTATIC: The PU_LEVELSTATIC purge tag has a special purpose. It works like PU_LEVEL so that it is purged on a per level basis, but blocks allocated as PU_LEVELSTATIC should not be freed at any time when the map is being used. Internally, the level-static blocks are linked into sequences so that Z_Malloc knows to skip all of them efficiently. This is possible because no block inside the sequence could be purged by Z_Malloc anyway.

How did I end up with these? I started looking at Z_Malloc more closely with Shark telling me the actual source lines where the most time is being spent. I noticed that the loop in Z_Malloc, where the function is looking for old purgable blocks, is taking up most of the time. Now, this is understandable since the map setup code is allocating huge numbers of memory blocks from the volume. However: those blocks won't be freed at any point when the map is being played, so what is point of checking them for purging? Thus I added a new purgelevel for these "static" PU_LEVEL blocks, and linked them into sequences so that Z_Malloc knows to skip them.

The fast malloc mode is a futher optimization that permits Z_Malloc to simply stop looking for purgable and free blocks completely and just allocate new blocks from the rover. This is only used during map setup. Once gameplay continues, fast mode is disabled and any free blocks or purgable blocks inside the volume are used. This shouldn't increase fragmentation inside the volumes much since the blocks that are allocated in fast mode are the huge PU_LEVELSTATIC sequences.

However, I recommend observing the console output for spurious Z_Create messages. If fragmentation is occuring, the zone will gradually get larger and larger as new stuff doesn't fit in the existing volumes. If this happens, we can disable fast mode and rely on the level-static sequences for performance.

Total setup time reduced down to 5 seconds, with mallocing disappearing completely:
<pre>
- 35.4% LG_Init (Doomsday)
+ 11.8% R_IsPointInSector2 (Doomsday)
| - 11.8% LG_Init (Doomsday)
- 4.3% SB_BeginFrame (Doomsday)
- 4.2% R_InitLineNeighbors (Doomsday)
- 4.1% V2_Intercept2 (Doomsday)
</pre>

LG_Init (multisample 1) is now the slowest function.

DaniJ · 2006 May 21

Wow. This is the kind of observation one can only make with proper profiling and debug tools.

Excellent work.

Now with LG_Init being the slowest performing function due largely to the number of calls to R_IsPointInSector2, we should look at implementing the optimization I proposed in the comments there. Before looking at R_IsPointInSector2 that is.

I started that last night but it was late and I couldn't figure the equation for calculating the total number of sample points (thus calls to R_IsPointInSector2) in the optimized scheme.

However, 5 seconds for a map this large is perfectly bareable.

Yagisan · 2006 May 21

<blockquote>Wow. This is the kind of observation one can only make with proper profiling and debug tools.</blockquote>FWIW, you can use the same tools (well many of them anyway), once I redo the autotools build system to make sure it can build with MinGW on WinNT.

DaniJ · 2006 May 21

I've now figured the correct equation for calculating the optimum number of sample points required for LG_Init so I'll get working on the optimization I mentioned.

It won't have anywhere near the impact on the level setup time as your recent Z_Malloc/rover optimization but it'll help.
<blockquote>FWIW, you can use the same tools (well many of them anyway), once I redo the autotools build system to make sure it can build with MinGW on WinNT.</blockquote>
Great, I'll look forward to that.

DaniJ · 2006 May 21

<blockquote>I've now figured the correct equation for calculating the optimum number of sample points required for LG_Init so I'll get working on the optimization I mentioned.</blockquote>
Well, the optimized algorithm is in place and appears to be working correctly however it seems the equation I'm using to calculate the optimum number of sample points isn't.

How's your algebra skyjake, yagisan? Could one of you take a look at the comments regarding this problem in LG_Init.

EDIT: Duh. Scratch that request, I've just realized what the problem is.

DaniJ · 2006 May 23

In SVN 3207:

Implemented the optimization described in the comments of LG_Init. Now only the optimum number of sample points are checked with R_PointInSubsector.

There is still room for improvement with regard to memory usage though.

skyjake - Could you run another profile on the map load process to see what the cleverer code in LG_Init does to the results.

Profile of Soul Tomb Setup

Comments