Thursday, April 14, 2016

One simple change removed heap allocations from our generic IO extensions

The more I use generics, the more I miss C++ templates (we'll just ignore the horrendous compile errors we can get with them when not using Clang).

While I was profiling our Unity project the other day I was wondering why, exactly, parts of our serialization extensions were popping up in our per-frame heap allocations report. While we don't do much serialization ops in client code, the server nodes do, so I felt it worthwhile to investigate.

I normally only profile in a non-development "release" build (although, I'm not certain Unity compiles assemblies with the /optimze+ flag). So by default I'm not seeing file/line info for everything since Unity doesn't give the option to force include the .mdb files (which our profiler tries to use when tracing allocations and other bits).

After making another build, this time with "Development Build" checked and after modifying the .pdb path in Unity's prebaked executable, I was up and running our memory profiled builds with full file/line info for managed and unmanaged code. The reports were showing that at line 330 of our extensions file was to blame for the curious heap allocations.

But how can I easily fix this near the end of a milestone and without going and changing a bunch of callsites? You can't specialize generic code based on the constraints alone, so I couldn't have two Read<T> methods which only differed from "where T:class/struct". I really didn't want to go and make, plus reeducate others to use, a ReadValue and ReadReference method when our Write<T> extensions don't require such verbose code. A few other suggestions were thrown at me, but I wasn't biting.

But wait...what about default arguments? What if I instead added a parameter based on T that uses its default value? This would allow me to avoid having to update a huge number of callsites, but what about still having to allocate reference objects? Turns out checking null is enough for them, while value types don't compare equally to null, so we can avoid the obj = new T() statement altogether for value types and use default arguments to avoid any other code fixup!


Yeah, I hate it when people post code in picture form too, so I put up the bits on gist.github too.

There can be a hidden cost here, but I factor anything GC related to be the biggest, nondeterministic, hidden cost. Specifically, the larger the T value type is, the more code can get generated to deal with copy-by-value semantics for the stack and return value. Although it is possible the runtime or native compiler could work some magic under the hood to use a reference to a stack value. It all depends! I'm not positive what Unity's Mono (IL2CPP would be easier to figure out) is doing under the hood for JIT'd code, but I do know we're no longer generating unexpected temporary garbage in our serialization code!

Tuesday, April 12, 2016

List.Reverse calls Array.Reverse, which may or may not box

You'd expect List<T>.Reverse to not box (ie, allocate memory on the managed heap) when you're using structs, right? Well, you would be wrong. And you'd expect List<T>.Reverse to not box when using basic builtin types like short, byte, right? Sometimes, you may be right.

This is the Array.Reverse implementation in Unity's Mono (some branch of 2.x). It won't box values when the array (List<T> uses an array internally) is an int[] or double[], but will on everything else (see line 1261).

This is the Array.Reverse implementation in the latest version of Mono. It won't box arrays of builtin types, but your enums and custom structs will still result in boxed values.

This is the Array.Reverse implementation in MS.NET's reference source. At the time of this writing, it will try to call the externally defined TrySZReverse function before going on to just box all teh things. Apparently, according to this API issue, that function is a fast-path for reversing primitive types (but your structs, and I'd imagine your enums, don't fall into that category). That API issue is up for review, so it could be that MS.NET will have sane generic Reverse'ing in the future with no boxing!

So yes, you'd expect that your .NET runtimes wouldn't box when performing List<T>.Reverse or Array.Reverse<T> when T is a value type, but you'd mostly be wrong most of the time.

Leave the boxing to Mike Tyson and Rocky!

Be mindful of Dictionary's keyed with enums. Also, the power of per-frame GC analysis!

If you’re going to use a Dictionary with enums in Unity, or any value types rather*, ensure you provide a custom (hopefully static) IEqualityComparer instance for it to use, instead of letting it use the default implementation. I’m not %100 sure about IL2CPP, but in Mono the default comparer (when used with value types) causes a separate heap allocation every time you do ContainsKey or use the Dictionary indexer (you should also probably ask yourself why you’re using ContainsKey/get_Item, instead of TryGetValue, too). The size of each BoehmGC heap allocation, not factoring for things like alignment, is (sizeof(void*) * 2) + sizeof(enum-underlying-type), which in our case was a byte so on 32-bit machines each allocation was 9 bytes.
* You have no choice when it comes to AOT environments like iOS when it comes to structs as keys
At my day job on an unannounced project, we had a particular enum to describe a set of hard coded things (let's call the enum HardCodedThing). Then a certain set of our source game data (let's call the class ProtoSomething) had two dictionaries that were keyed by this enum for various reasons. In our game's sim we have a method, we'll call it UpdateSomethings(), which does a lot of HardCodedThing indexing. The code is ran when a sim is deserialized or when some other state is derived from the sim.

For the past week or so, I've been adding per-frame memory analysis to the custom memory tools that I worked with Rich Geldreich to implement. I marked two custom events (strings which the game runtime sends to the memory profiler at specific points for later reference) as the start and end frames (or rather, the frames which the events took place in). I was interested, well worried, as to why we had so many GC allocations in general when just idling in the game. It turns out our sanity checks for Animator components having parameters before trying to set them was generating garbage (detailed later). It also just so happened that the sim state deriving I mentioned earlier was taking place shortly before the marked last custom event, so heap allocations for the HardCodedThing enum were bubbling up.

The fix to avoid all these allocations was to simply add a Comparer that is then passed in to our Dictionary<HardCodedThing, TValue> instances. The implementation, using the disguised typenames used in this article, can be found here.

Adding the per-frame analysis utilities to our tool has been EXTREMELY helpful (I found some other GC-heavy APIs specific to Unity that I'm sure we're not the only ones ignorant of...should blog about them soon). We're not stuck scratching our heads by what's making the stock Unity Profiler show "GC Allocations per Frame: 9999 / 12 KB" or some general nonsense. We really don't even bother using Unity's Memory profiler. Reason being is that it's just a snapshot. Snapshots have limited use. It's also far easier and quicker to make a Standalone build vs a device build, since their tool is IL2CPP only. We instead have a transaction log of all Boehm/Mono memory related operations from the start of the process until the very end. Coupled with our custom events setup we can create near-deterministic transaction logs over multiple runs. In theory, we could probably even setup a BVT to ensure we're not leaking managed memory or churning GC memory more than we already expect to.

And when/if we do, we have a boat load of information at our disposal to diagnose problems. Part of the per-frame analysis was to first spit out a CSV file mapping types seen across a set of frames to how many allocations and frees they were associated with, along with their average garbage (if alloc:free is near 1:1, you have garbage heavy code using that type somewhere!). You can find an excerpt of the CSV report here. I mentioned that we had code sanity checking Animators for parameters we are trying to set earlier, and here you can see a glimpse of how they were showing up in our reports.

With this broad report, the next thing I added was a 'blame list' report. The blame list details, per type, the sets of backtraces which are spawning these allocations. We then sort and break down these backtraces by their allocations, so the biggest offenders of garbage are at the top. You can find an excerpt of the blame list report here.

Perf matters. Memory usage matters. Tools matter. While the studio I work at probably won't open source our memory tool's full pipeline anytime soon for various reasons (although our Mono modifications can be found here), I'm hoping to use this blog to publicly speak more about bad/good patterns we see/learn from experience and verified using the tools we've developed.