Profile of Soul Tomb Setup
<i>This post was originally made by <b>skyjake</b> on the dengDevs blog. It was posted under the categories: Engine, jDoom.</i>
I ran Apple's Shark profiling tool on <tt>soultomb-2-1.wad</tt>. It works by periodically interrupting running processes and sampling their internal state, including the function callstack. During the map setup, Shark collected 174157 samples which show the following results.
<pre>+ 92.7% Z_Malloc (Doomsday)
| - 85.6% RIT_ShadowSubsectorLinker (Doomsday)
| - 3.6% P_InitSubsectorBlockMap (Doomsday)
| - 1.8% Z_Calloc (Doomsday)
| - 1.8% R_InitPlanePoly (Doomsday)
| - 0.0% DL_InitLinks (Doomsday)
| - 0.0% Str_Set (Doomsday)
| - 0.0% LG_Init (Doomsday)
- 1.4% ml_set_interrupts_enabled (mach_kernel)
- 1.0% LG_Init (Doomsday)
- 0.6% S_CalcSectorReverbs (Doomsday)
- 0.5% Resampler2::ConvertAltivec() (AudioToolbox)
- 0.3% R_IsPointInSector2 (Doomsday)
- 0.2% P_SubsectorBoxIteratorv (Doomsday)
- 0.2% R_InitLineNeighbors (Doomsday)
</pre>
Observations:
<ul>
<li>Z_Malloc is indeed the function where the most time is being spent.
<li>The majority of Z_Malloc calls (~86%) originate from RIT_ShadowSubsectorLinker (as you suspected, Dani). This is definitely the place where optimization is needed.
<li>A minority of calls come from the subsector blockmap init (3.6%).
<li>LG_Init is quite fast, with only 1% of the setup time. Since the setup took approximately 174 seconds, this means that LG_Init only takes about 2 seconds for this map.
</ul>
<blockquote>I doubt there is a direct correlation between the number of subsectors/shadowpolys and the number of shadowlinks required(?). So I guess we can't allocate all we need in one go so it would be best to allocate in batches of a reasonable number using a pool.</blockquote> I agree. Since shadowlink_t is very small (2 pointers), we can allocate something like 1024 of them at a time. I'm thinking it should go something like this:
<ol>
<li>Allocate the first 1024 batch.
<li>In R_LinkShadow, use the next free shadowlink_t from the batch.
<li>When 1024 links have been used up, allocate the next batch and start using those in R_LinkShadow.
<li>Note that in the end we cannot realloc the last batch to save space because that might change the address of the memory block and thus invalidate any pointers to the links in the last batch.
</ol>
Dani, do you want to take care of this?
I ran Apple's Shark profiling tool on <tt>soultomb-2-1.wad</tt>. It works by periodically interrupting running processes and sampling their internal state, including the function callstack. During the map setup, Shark collected 174157 samples which show the following results.
<pre>+ 92.7% Z_Malloc (Doomsday)
| - 85.6% RIT_ShadowSubsectorLinker (Doomsday)
| - 3.6% P_InitSubsectorBlockMap (Doomsday)
| - 1.8% Z_Calloc (Doomsday)
| - 1.8% R_InitPlanePoly (Doomsday)
| - 0.0% DL_InitLinks (Doomsday)
| - 0.0% Str_Set (Doomsday)
| - 0.0% LG_Init (Doomsday)
- 1.4% ml_set_interrupts_enabled (mach_kernel)
- 1.0% LG_Init (Doomsday)
- 0.6% S_CalcSectorReverbs (Doomsday)
- 0.5% Resampler2::ConvertAltivec() (AudioToolbox)
- 0.3% R_IsPointInSector2 (Doomsday)
- 0.2% P_SubsectorBoxIteratorv (Doomsday)
- 0.2% R_InitLineNeighbors (Doomsday)
</pre>
Observations:
<ul>
<li>Z_Malloc is indeed the function where the most time is being spent.
<li>The majority of Z_Malloc calls (~86%) originate from RIT_ShadowSubsectorLinker (as you suspected, Dani). This is definitely the place where optimization is needed.
<li>A minority of calls come from the subsector blockmap init (3.6%).
<li>LG_Init is quite fast, with only 1% of the setup time. Since the setup took approximately 174 seconds, this means that LG_Init only takes about 2 seconds for this map.
</ul>
<blockquote>I doubt there is a direct correlation between the number of subsectors/shadowpolys and the number of shadowlinks required(?). So I guess we can't allocate all we need in one go so it would be best to allocate in batches of a reasonable number using a pool.</blockquote> I agree. Since shadowlink_t is very small (2 pointers), we can allocate something like 1024 of them at a time. I'm thinking it should go something like this:
<ol>
<li>Allocate the first 1024 batch.
<li>In R_LinkShadow, use the next free shadowlink_t from the batch.
<li>When 1024 links have been used up, allocate the next batch and start using those in R_LinkShadow.
<li>Note that in the end we cannot realloc the last batch to save space because that might change the address of the memory block and thus invalidate any pointers to the links in the last batch.
</ol>
Dani, do you want to take care of this?
Comments
Yep I can take care of this.
I'm thinking it would be useful to have a generic mechanism for this kind of memory allocation functioning like an object. So perhaps we could do something like:
<code>
typedef zblock_s {
int numelementsperblock;
int lastusedelement;
size_t elementsize;
void** arrayptrtoblocks;
} zblock_t;</code>
zblock_t *Z_BlockMalloc(size_t sizeofelement, int batchsize, int purgelevel);
void *Z_BlockNew(zblock_t* block);
void Z_BlockDestroy(zblock_t* block); // Would free the entire thing (change purge tag), all blocks, all elements.
Also, instead of lastusedelement, just save the number of used elements. Then the first unused is elements[count] (or rather, <tt>((byte*)elements) + (elementSize * count)</tt>).
<pre>
typedef struct zblock_s {
unsigned int max; // maximum number of elements
unsigned int count; // number of used elements
size_t elementSize; // size of a single element
void* elements;
// block of memory where elements
// are located in (cast to byte to get offset)
} zblock_t;
</pre>
In your version above I notice we are not retaining ptrs to previously allocated blocks. I take it there is no need (given that ptrs to the objects themselves will be used elsewhere and the memory is from the zone anyway)?
It was interesting to hear the results of LG_Init with this map. I was expecting that to take much longer.
I assume you profiled using the default "rend-bias-grid-multisample 1" (5 samples per block). It would be interesting to compare the results of the same test with the multisample factor set to 7 (65 samples per block) to see how well the current algorithm scales in practice.
BTW, see my edits of my previous post about zblock_t.
Yes, it would make it easier to use a zblock_t if there is no need for manual bookkeepping. However, I think a zblock_t should really be a simple block of elements. If we don't do a generic linked list system for the bookkeeping, we can do something like:
<pre>
typedef struct zblockset_s {
unsigned int count;
zblock_t* blocks;
} zblockset_t;
</pre>
Then the user would allocate elements from the blockset, and the blockset would add blocks as needed.
Good point. I was attempting to merge both functionality into one which in retrospect doesn't seem like a good idea to me either.
The zmalloced blocks shouldn't really be public either. The ideal implementation would encapsulate them using said utility structure to handle the booking and provide the public interface(s) to the zmalloced blocks.
Complex data structures could then be built up using the interface of the utility stucture which intern uses storage in zblock form.
<blockquote>Ah, I re-read your earlier comment and understood that you have multiple blocks in one zblock_t. (You are missing a counter for the number of allocated blocks.)</blockquote>
Opps, yes I forgot to add that vital clue.
Should zblock_t be intergrated directly with the memory zone handling? For example, when zmalloced memory pointed to by zblock_t is freed by a tag purge should zblock_t be zeroed out automatically?
<pre>
- 87.9% Z_Malloc (Doomsday)
+ 4.3% R_IsPointInSector2 (Doomsday)
| - 4.3% LG_Init (Doomsday)
- 1.5% ml_set_interrupts_enabled (mach_kernel)
- 1.2% LG_Init (Doomsday)
- 0.8% R_PointInSubsector (Doomsday)
- 0.6% S_CalcSectorReverbs (Doomsday)
- 0.5% Resampler2::ConvertAltivec() (AudioToolbox)
- 0.3% FixedMul (Doomsday)
- 0.2% P_SubsectorBoxIteratorv (Doomsday)
- 0.2% R_InitLineNeighbors (Doomsday)
</pre>
The LG_Init time definitely increases, from the 1% of the first case to about 5.5% (total). Interestingly, the function that takes up the largest amount of time is R_IsPointInSector2.
See my notes about a potential optimization in LG_Init. Currently when "rend-bias-grid-multisample" > 0 there is a considerable number of calls to R_IsPointInSector2 which can be completely avoided. On average 40% of the calls are unnecessary and that figure increases the greater the multisample factor * number of blocks.
We just need to be a bit cleverer and cut out the duplicated calls by doing all the R_IsPointInSector2 work before going into the intialization loop.
The results are too big to paste here, so I'll just link to them.
The flat profile:
http://eyagi.bpa.nu/~jamie/uf/deng-3187/flat_profile.txt
The flat profile, line by line output:
http://eyagi.bpa.nu/~jamie/uf/deng-3187/flat_profile-line_by_line.txt
Call Graph output:
http://eyagi.bpa.nu/~jamie/uf/deng-3187/call_graph.txt
My raw data:
http://eyagi.bpa.nu/~jamie/uf/deng-3187/
Annotated source files:
http://eyagi.bpa.nu/~jamie/uf/deng-3187/annotated_source/
It looks like it, there are functions listed there which are only called during startup.
In SVN 3189:
CHANGED: Implemented zblocks and zblocksets to be used to replace linear memory allocations when there are an unknown number of elements at the start of the process. Currently only being used for the shadow links.
I would be greatful if someone could give it the once over.
Although we can pretty much guess the outcome (soultomb-2-1.wad MAP06 loads in about 30 seconds for me now with "rend-bias-grid-multisample 1") - I think we'd better do another profile with this new code in place.
If that all works out ok we should have a look to see where else could benefit from this optimization.
Total setup time reduced from 174 seconds to 18 seconds, with shadow linking simply dropping off the radar:
<pre>
+ 65.4% Z_Malloc (Doomsday)
| - 32.2% P_InitSubsectorBlockMap (Doomsday)
| + 15.9% Z_Calloc (Doomsday)
| | - 15.7% R_InitPlanePoly (Doomsday)
| | - 0.1% DL_InitLinks (Doomsday)
| | - 0.0% R_SetupLevel (Doomsday)
| | - 0.0% LG_Init (Doomsday)
| | - 0.0% R_InitSectorShadows (Doomsday)
| | - 0.0% P_InitSubsectorBlockMap (Doomsday)
| - 15.8% R_InitPlanePoly (Doomsday)
| - 0.7% Z_Recalloc (Doomsday)
| - 0.7% Z_BlockNewElement (Doomsday)
| - 0.0% Z_BlockCreate (Doomsday)
| - 0.0% R_InitVertexOwners (Doomsday)
| - 0.0% DL_InitLinks (Doomsday)
| - 0.0% LG_Init (Doomsday)
- 8.6% LG_Init (Doomsday)
- 6.1% S_CalcSectorReverbs (Doomsday)
+ 3.5% R_IsPointInSector2 (Doomsday)
| - 3.5% LG_Init (Doomsday)
- 1.7% R_InitLineNeighbors (Doomsday)
</pre>
Observations:
<ul>
<li>Now that the single greatest slowdown has been eliminated, there is a wider distribution of functions taking up time.
<li>Z_Malloc still leads, though, so perhaps at least P_initSubsectorBlockMap should also receive some optimization.
<li>The other significant Z_Malloc caller is R_InitPlanePoly (directly and via Z_Calloc).
<li>LG_Init (multisample 1) has risen to 8.6%. This noticeably takes a few seconds during the setup. However, it's still quite a little when compared to Z_Malloc.
<li>Z_BlockNewElement is in the charts with 0.7%, undoubtedly due to the 1024 element max. If the maximum were raised, this would disappear completely (however, 0.7% isn't that bad...).
</ul>
<i>All we need is lightning
With power and might
Striking down the prophets of false</i>
Yes, your revised version is more elegant. I'm glad to see my implementation wasn't far off the mark though.
Thanks for explaining the changes too btw, it helps me to understand better what is considered good/common practice.
<blockquote>Total setup time reduced from 174 seconds to 18 seconds, with shadow linking simply dropping off the radar:</blockquote>
Excellent.
<blockquote>
<ul>
<li>The other significant Z_Malloc caller is R_InitPlanePoly (directly and via Z_Calloc).</li>
<li>LG_Init (multisample 1) has risen to 8.6%. This noticeably takes a few seconds during the setup. However, it's still quite a little when compared to Z_Malloc.</li>
</ul>
</blockquote>
Now, with multisample 7 - I would expect LG_Init to close the gap between it and Z_Malloc significantly.
Both LG_Init and P_InitSubsectorBlockMap allocate linearly too and perhaps could benefit from using zblocks. Try dropping them in and see how it affects the results.
<blockquote>All we need is lightning
With power and might
Striking down the prophets of false</blockquote>
As the moon is rising
Give us the sign
Now let us rise up in awe
Oddly, I recognised that instantly... good luck Finland
What I should do, is play through a few levels on defalt settings (excluding controls) to get a feel for what is the main bottle next. I'll do a run for doom, heretic and hexen.
Fast malloc mode: memory volumes aren't checked for purgable blocks. If the rover block isn't suitable, a new empty volume is created without further checking. This is suitable for cases where lots of blocks are being allocated in a rapid sequence, with no frees in between (e.g., map setup).
PU_LEVELSTATIC: The PU_LEVELSTATIC purge tag has a special purpose. It works like PU_LEVEL so that it is purged on a per level basis, but blocks allocated as PU_LEVELSTATIC should not be freed at any time when the map is being used. Internally, the level-static blocks are linked into sequences so that Z_Malloc knows to skip all of them efficiently. This is possible because no block inside the sequence could be purged by Z_Malloc anyway.
How did I end up with these? I started looking at Z_Malloc more closely with Shark telling me the actual source lines where the most time is being spent. I noticed that the loop in Z_Malloc, where the function is looking for old purgable blocks, is taking up most of the time. Now, this is understandable since the map setup code is allocating huge numbers of memory blocks from the volume. However: those blocks won't be freed at any point when the map is being played, so what is point of checking them for purging? Thus I added a new purgelevel for these "static" PU_LEVEL blocks, and linked them into sequences so that Z_Malloc knows to skip them.
The fast malloc mode is a futher optimization that permits Z_Malloc to simply stop looking for purgable and free blocks completely and just allocate new blocks from the rover. This is only used during map setup. Once gameplay continues, fast mode is disabled and any free blocks or purgable blocks inside the volume are used. This shouldn't increase fragmentation inside the volumes much since the blocks that are allocated in fast mode are the huge PU_LEVELSTATIC sequences.
However, I recommend observing the console output for spurious Z_Create messages. If fragmentation is occuring, the zone will gradually get larger and larger as new stuff doesn't fit in the existing volumes. If this happens, we can disable fast mode and rely on the level-static sequences for performance.
Total setup time reduced down to 5 seconds, with mallocing disappearing completely:
<pre>
- 35.4% LG_Init (Doomsday)
+ 11.8% R_IsPointInSector2 (Doomsday)
| - 11.8% LG_Init (Doomsday)
- 4.3% SB_BeginFrame (Doomsday)
- 4.2% R_InitLineNeighbors (Doomsday)
- 4.1% V2_Intercept2 (Doomsday)
</pre>
LG_Init (multisample 1) is now the slowest function.
Excellent work.
Now with LG_Init being the slowest performing function due largely to the number of calls to R_IsPointInSector2, we should look at implementing the optimization I proposed in the comments there. Before looking at R_IsPointInSector2 that is.
I started that last night but it was late and I couldn't figure the equation for calculating the total number of sample points (thus calls to R_IsPointInSector2) in the optimized scheme.
However, 5 seconds for a map this large is perfectly bareable.
It won't have anywhere near the impact on the level setup time as your recent Z_Malloc/rover optimization but it'll help.
<blockquote>FWIW, you can use the same tools (well many of them anyway), once I redo the autotools build system to make sure it can build with MinGW on WinNT.</blockquote>
Great, I'll look forward to that.
Well, the optimized algorithm is in place and appears to be working correctly however it seems the equation I'm using to calculate the optimum number of sample points isn't.
How's your algebra skyjake, yagisan? Could one of you take a look at the comments regarding this problem in LG_Init.
EDIT: Duh. Scratch that request, I've just realized what the problem is.
Implemented the optimization described in the comments of LG_Init. Now only the optimum number of sample points are checked with R_PointInSubsector.
There is still room for improvement with regard to memory usage though.
skyjake - Could you run another profile on the map load process to see what the cleverer code in LG_Init does to the results.