Connect Stress
<i>This post was originally made by <b>skyjake</b> on the dengDevs blog. It was posted under the categories: Blog, Engine, Mac OS X.</i>
Lately I've been doing some network debugging. This has to be one of the most complicated debugging tasks there is, together with debugging multithreaded programs.
So far I've only been using a single computer with a dual-core CPU, so while some true parallel processing occurs, the network connection itself is still working perfectly with virtually no latency. To mix things up, Doomsday's network code has a transmission randomizer, which applies a random latency to all sent packets. Also, the randomizer drops 1% of the sent packets. This mainly affects game delta packets, which constitute the majority of network traffic between the server and a client. However, the randomizer should be enhanced to also affect packets sent via the reliable TCP connection. The reliable connection is used when the server and a client communicate about connection handshaking, game state changes, and other such information that needs to be handled in the right order and must not get lost on the way.
I have been trying to reproduce <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1019136&group_id=74815&atid=542099">the problem</a> where one of the participants will crash when a new player joins the game. I set up one host player and three clients, and had the clients repeatedly connect, wait a while, and the disconnect. I made <a href="http://idisk.mac.com/skyjake/Public/Connect Stress.mp4">a video of this</a> for your enjoyment.
<center><img src="/images/Connect%20Stress%20Frame.jpg"/></center>
During the first couple of runs I immediately noticed and fixed a couple of problems. The first was that prematurely disconnected TCP connections were aborting the entire Doomsday process. This would cause a client, or the server, to simply exit with no error messages whatsoever. Unix/Linux users would see an SDL parachute message. Another problem was that, under some circumstances which are still a bit unclear to me, a client would try to destroy a NULL mobj, which caused a segfault. Since the client was attempting to destroy the mobj, it shouldn't matter if the mobj has already been destroyed, so I just added a simple check to see that the mobj really exists prior to the destruction attempt. This seemed to be enough.
After a while I was able to run the connect/disconnect cycle for minutes at a time without encountering any problems. There might still be some issues remaining, however, since my test setup didn't cover long resource loading times or any other games besides Doom E1M1.
I did get one instance of a segfault in NetSv_NewPlayerEnters(), but could not reproduce it afterwards. This would crash the server when a new player enters the game.
The fixes are now (rev 3344) checked in, so if anyone fancies doing some netgame testing, now would be an excellent time.
I also noticed some other issues that have to be fixed:
<ul>
<li>Clientside plane movement is very choppy. Clients should move planes locally by utilizing the target height and time values.
<li>Switches were not animated on clientside.
<li>Deathmatch frag counters were missing.
<li>When running with the randomizer, psprite animations were seriously messed up. On multiple occasions, the secondary part of the psprite (muzzle flash) would remain visible after firing. This is probably caused by the psprite animation messages arriving out of order. IIRC, the order of pspr updates is never checked. Perhaps they should be sent over the TCP connection, where they cannot be lost or lose their order. On the other hand, firing animations should be done locally by the client, or otherwise there's always an annoying delay.
</ul>
PS. I'll pull the video if my iDisk bandwidth starts nearing its limit...
Lately I've been doing some network debugging. This has to be one of the most complicated debugging tasks there is, together with debugging multithreaded programs.
So far I've only been using a single computer with a dual-core CPU, so while some true parallel processing occurs, the network connection itself is still working perfectly with virtually no latency. To mix things up, Doomsday's network code has a transmission randomizer, which applies a random latency to all sent packets. Also, the randomizer drops 1% of the sent packets. This mainly affects game delta packets, which constitute the majority of network traffic between the server and a client. However, the randomizer should be enhanced to also affect packets sent via the reliable TCP connection. The reliable connection is used when the server and a client communicate about connection handshaking, game state changes, and other such information that needs to be handled in the right order and must not get lost on the way.
I have been trying to reproduce <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1019136&group_id=74815&atid=542099">the problem</a> where one of the participants will crash when a new player joins the game. I set up one host player and three clients, and had the clients repeatedly connect, wait a while, and the disconnect. I made <a href="http://idisk.mac.com/skyjake/Public/Connect Stress.mp4">a video of this</a> for your enjoyment.
<center><img src="/images/Connect%20Stress%20Frame.jpg"/></center>
During the first couple of runs I immediately noticed and fixed a couple of problems. The first was that prematurely disconnected TCP connections were aborting the entire Doomsday process. This would cause a client, or the server, to simply exit with no error messages whatsoever. Unix/Linux users would see an SDL parachute message. Another problem was that, under some circumstances which are still a bit unclear to me, a client would try to destroy a NULL mobj, which caused a segfault. Since the client was attempting to destroy the mobj, it shouldn't matter if the mobj has already been destroyed, so I just added a simple check to see that the mobj really exists prior to the destruction attempt. This seemed to be enough.
After a while I was able to run the connect/disconnect cycle for minutes at a time without encountering any problems. There might still be some issues remaining, however, since my test setup didn't cover long resource loading times or any other games besides Doom E1M1.
I did get one instance of a segfault in NetSv_NewPlayerEnters(), but could not reproduce it afterwards. This would crash the server when a new player enters the game.
The fixes are now (rev 3344) checked in, so if anyone fancies doing some netgame testing, now would be an excellent time.
I also noticed some other issues that have to be fixed:
<ul>
<li>Clientside plane movement is very choppy. Clients should move planes locally by utilizing the target height and time values.
<li>Switches were not animated on clientside.
<li>Deathmatch frag counters were missing.
<li>When running with the randomizer, psprite animations were seriously messed up. On multiple occasions, the secondary part of the psprite (muzzle flash) would remain visible after firing. This is probably caused by the psprite animation messages arriving out of order. IIRC, the order of pspr updates is never checked. Perhaps they should be sent over the TCP connection, where they cannot be lost or lose their order. On the other hand, firing animations should be done locally by the client, or otherwise there's always an annoying delay.
</ul>
PS. I'll pull the video if my iDisk bandwidth starts nearing its limit...
Comments
You can put the vid on jfiles if you want
How are they being moved at present? Surely new height deltas are not being sent for all moving planes every few milliseconds?
<blockquote>Deathmatch frag counters were missing.</blockquote>
That sounds like something I should fix, I'll look into it. I've been thinking about the deathmatch frag display and a few improvements I'd like to implement.
<blockquote>(clientside) psprite animations were seriously messed up</blockquote>
I would guess psprite animation should really happen client side only. We can always do a rough check with the client side world and player ammo info so that psprite animation firing doesn't happen when it shouldn't.
BTW - I'm getting the following on win32 when attempting to build from SVN [3344]:
<blockquote>../../engine\portable\src\sys_system.c(104) : error C2065: 'SIGPIPE' : undeclared identifier</blockquote>
BTW2 - What codec was used to compress the video?
<blockquote>That sounds like something I should fix, I'll look into it. I've been thinking about the deathmatch frag display and a few improvements I'd like to implement.</blockquote> Great. I can see the player names in the automap, but no frag counts.
<blockquote>error C2065: `SIGPIPE' : undeclared identifier</blockquote> I'll see what's going on. Probably a header is missing.
<blockquote>BTW2 - What codec was used to compress the video?</blockquote> That's MPEG-4. iMovie tells me the actual codec is called MPEG-4 Improved, whatever that means. The latest Quicktime player from Apple should play it just fine.
Ah you mean the counters next to the player names in the automap. Ok I'll fix it.
I found a bug with the CVAR hud-frags-all debug display, should be fixed in SVN [3372].