## Profiling to Reduce Memory Allocations

Updated: Thursday, December 16, 2021

Make it work, make it right, make it fast. – Kent Beck

Yesterday I decided to do some profiling of my open source Go project, Flipt, to try and find some low-hanging fruit that I could optimize to give some quick performance wins. This process allowed me to find some unexpected issues in a popular open source project that I was using for routing and middleware in my application. In the end, I was able to reduce memory allocations in my app by 100x, thus reducing the amount of GC cycles and improving overall performance. Here’s how I went about it.

## Generate

Before I could actually start profiling, I knew that I first needed to generate a large enough amount of traffic to my application to be able to see any patterns. This created an issue since I have nothing that actually uses Flipt in production that would receive any amount of traffic to be useful. So, I reached for an awesome load testing tool, Vegeta, to generate that traffic for me.

Per the Vegeta author(s):

Vegeta is a versatile HTTP load testing tool built out of a need to drill HTTP services with a constant request rate.

This was exactly what I needed as it would allow me to generate a constant stream of requests over as long of a period as required to be able to measure things like heap allocations/usage, goroutines and time spent in GC within my application.

After some experimenting, I ended up with the following command:

echo 'POST http://localhost:8080/api/v1/evaluate' | vegeta attack -rate 1000 -duration 1m -body evaluate.json


This command started vegeta in attack mode, sending HTTP POST requests to Flipt’s REST API at a rate of 1000 requests per second (which I admit is a lot) for a duration of a minute. The JSON payload that I sent to Flipt is not really important, it just needed to be a valid body that would be accepted by the Flipt server so that it would perform proper evaluation of the request.

note: I decided to bombard the /evaluate endpoint of Flipt first, as it performs the most logic and ‘complex’ calculations on the backend which I thought would give me the best idea of where performance could be improved.

Level Up With Go

No spam. I promise.

## Measure

Now that I had a way to generate a large enough amount of traffic, I needed to actually be able to measure the results that this traffic had on my application while it was running. Luckily, Go comes with a pretty great standard set of tools to allow you to measure the performance of your Go programs in the form of it’s pprof suite.

I’m not going to go into the details of enabling pprof in your application, as I don’t think I could do it any better justice than Julia Evans': Profiling Go programs with pprof post (seriously check it out if you haven’t).

Since I use go-chi/chi for my HTTP router in Flipt, I could enable pprof in my application easily by using Chi’s profile middleware.

So, with Flipt running in one window, and Vegata hammering away at it in another, I opened up a third terminal to grab and inspect a heap profile with:

pprof -http=localhost:9090 localhost:8080/debug/pprof/heap


At first I checked out inuse_objects and inuse_space to see what was currently going on in the heap, but didn’t really see much of note. However, once I switched over to view alloc_objects and alloc_space, something piqued my interest.

It seems that something called flate.NewWriter had allocated 19370 MB of memory over the course of a minute. That’s over 19 Gigabytes! Something was clearly going on, but what? If you enlarge the diagram and look closely, you can see that flate.NewWriter is called from gzip.(*Writer).Write which is called from middleware.(*compressResponseWriter).Write. I quickly realized that this had nothing to do with the code in Flipt itself, but was instead a result of my use of Chi’s compression middleware library that provides compression of responses from the API.

// the offending line
r.Use(middleware.Compress(gzip.DefaultCompression))


I commented out the above line and re-ran my profiling tests and low and behold the huge amount of memory allocs was gone!

Before I looked for a fix, I wanted to get another view of these allocations and how it was affecting performance, notably time spent in garbage collection. I remembered that Go also has a tool trace that allows you view the execution of your Go program over time, and includes important stats such as heap usage, number of goroutines running, network and syscalls, and most importantly for our purposes, time spent in GC.

In order to effectively capture a trace, I had to turn down the number of requests per second from Vegeta as I was regularly getting socket: too many open files errors returned from the server. I assumed this was because my ulimit was set too low on my local machine, but I didn’t want to fuss with that at the time.

I re-ran Vegeta with:

echo 'POST http://localhost:8080/api/v1/evaluate' | vegeta attack -rate 100 -duration 2m -body evaluate.json


This gave me 1/10th the number of requests per second but over a longer period of time so that I was able to capture an effective trace.

In another terminal I ran:

wget 'http://localhost:8080/debug/pprof/trace?seconds=60' -O profile/trace


This generated a trace file over a period of 60 seconds and saved it to my local machine. I could then inspect the trace with:

go tool trace profile/trace


This opened up the trace in my browser allowing me to inspect the output visually.

note: I highly recommend Pusher’s: go tool trace post for more info on go tool trace and what you can do with it.

As you can see in the above graph, the heap seems to grow rather quickly and frequently with sharp drop-offs representing GC. You can also see the pronounced blue bars in the GC swimlane, representing time spent in GC.

This was all the evidence I needed to go searching for a solution to my memory woes.

## Fix

To see why flate.NewWriter was causing so many allocations, I needed to look at the Chi source code. To see which version I was using I ran:

➜ go list -m all | grep chi

github.com/go-chi/chi v3.3.4+incompatible


Looking at the source for chi/middleware/compress.go @v3.3.4 I was able to see this method:

func encoderDeflate(w http.ResponseWriter, level int) io.Writer {
dw, err := flate.NewWriter(w, level)
if err != nil {
return nil
}
return dw
}


Upon further inspection I saw that flate.NewWriter was being called for each response via the middleware! This corresponded to the huge amount of allocations I saw earlier when hitting my API at 1000 rps.

Not wanting to lose compression for my API responses nor wanting to search for a new HTTP router and middleware library, I first checked to see if simply updating Chi would solve my problem.

I ran go get -u -v "github.com/go-chi/chi" which updated me to v4.0.2 of Chi however the compression middleware seemed largely unchanged. Indeed, I re-ran my profiling tests and the allocation issue still remained.

Before calling it a day, I decided to search for any issues/PRs that mentioned the compression middleware in Chi and found this one titled: Re-wrote the middleware compression library. In it, the author mentions:

Additionally, this uses a sync.Pool for encoders that have a Reset(io.Writer) method to reduce memory overhead.

Yes!!

Thankfully this PR had been merged to master, but since a new release of Chi had not been created, I needed to upgrade using:

go get -u -v "github.com/go-chi/chi@master"


Luckily for me, this was a backwards compatible upgrade and required no code changes on my part.

## Results

Running the load tests and profiling one last time, I was able to verify that this did indeed fix the issue.

Viewing a new trace, you can see that the heap grows at a more steady rate, and the overall number of GC’s and time spent in GC is reduced:

I soon merged the change to Flipt, knowing I could feel more confident in my application’s performance under load.

### Lessons Learned

1. Don’t assume that (even popular) open source libraries have been optimized or are free of low-hanging fruit.
2. An innocent issue can cause a great ripple effect, especially under load.
3. Use sync.Pool when appropriate.
4. Load testing and profiling are great tools to have in your tool belt.
5. The Go toolset and open source community is amazing, 👏 all around!