Robert Mosolgo

Optimizing GraphQL-Ruby

Soon, graphql-ruby 1.5.0 will be released. Query execution will be ~70% faster than 1.3.0!

Let’s look at how we reduced the execution time between those two versions. Thanks to @theorygeek who optimized the middleware chain helped me pinpoint several other bottlenecks!

The Benchmark

To track GraphQL execution overhead, I execute the introspection query on a fixture schema in graphql-ruby’s test suite.

On GraphQL 1.3.0, the benchmark ran around 22.5 iterations per second:

On master, it runs around 38 iterations per second:

That’s almost 1.7x faster!

1
2
38.0 / 22.5
# => 1.6888888888888889

So, how’d we do it?

Looking Under the Hood with RubyProf

To find where time was spent, I turned to ruby-prof. I wrapped GraphQL execution with profiling and inspected the result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Thread ID: 70149906635240
Fiber ID: 70149911114440
Total: 0.474618
Sort by: self_time

 %self      total      self      wait     child     calls  name
  4.60      0.074     0.022     0.000     0.052     6893  *Class#new
  3.99      0.019     0.019     0.000     0.000     8715  *GraphQL::Define::InstanceDefinable#ensure_defined
  3.13      0.015     0.015     0.000     0.000    25403   Module#===
  2.64      0.013     0.013     0.000     0.000     8813   Kernel#hash
  2.49      0.074     0.012     0.000     0.063     3496  *GraphQL::Schema::MiddlewareChain#call
  1.85      0.009     0.009     0.000     0.000     4184   GraphQL::Query::Context::FieldResolutionContext#query
  1.78      0.017     0.008     0.000     0.008     2141   #<Module:0x007f9a18de37a8>#type
  1.63      0.008     0.008     0.000     0.000     1960   GraphQL::Query::Context::FieldResolutionContext#initialize
  1.54      0.012     0.007     0.000     0.005     1748   GraphQL::Query#get_field
  1.53      0.014     0.007     0.000     0.006     1748   GraphQL::Query#arguments_for
  1.52      0.007     0.007     0.000     0.000     8356   Kernel#is_a?
  1.51      0.010     0.007     0.000     0.003     7523   Kernel#===
  1.44      0.022     0.007     0.000     0.015     1959   GraphQL::Query::Context::FieldResolutionContext#spawn
  1.32      0.012     0.006     0.000     0.006     1748   GraphQL::Execution::Lazy::LazyMethodMap#get
  1.31      0.010     0.006     0.000     0.003     1748   GraphQL::Execution::FieldResult#value=
  1.29      0.032     0.006     0.000     0.026     1748   GraphQL::Field#resolve
  1.25      0.042     0.006     0.000     0.037     1748   #<Module:0x007f9a18de37a8>#resolve
  1.16      0.015     0.006     0.000     0.010     1748   GraphQL::Execution::FieldResult#initialize
  1.06      0.010     0.005     0.000     0.005     2815   GraphQL::Schema::Warden#visible?
  1.05      0.014     0.005     0.000     0.009     1748   GraphQL::Schema::MiddlewareChain#initialize
  1.03      0.005     0.005     0.000     0.000     2815   <Module::GraphQL::Query::NullExcept>#call
  0.97      0.014     0.005     0.000     0.009      756   Hash#each_value
# ... truncated ...

A few things stood out:

  • ~5% of time was spent during ~7k calls to Class#new: this is time spent initializing new objects. I think initialization can also trigger garbage collection (if there’s not a spot on the free list), so this may include GC time.
  • ~4% of time was spent during ~9k calls to InstanceDefinable#ensure_defined, which is part of graphql-ruby’s definition API. It’s all overhead to support the definition API, 😿.
  • Several methods are called 1748 times. Turns out, this is once per field in the response.
  • With that in mind, 25,403 seems like a lot of calls to Module#===!

Reduce GC Pressure

Since Class#new was the call with the most self time, I thought I’d start there. What kind of objects are being allocated? We can filter the profile output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
~/code/graphql $ cat 130_prof.txt | grep initialize
  1.63      0.008     0.008     0.000     0.000     1960   GraphQL::Query::Context::FieldResolutionContext#initialize
  1.16      0.015     0.006     0.000     0.010     1748   GraphQL::Execution::FieldResult#initialize
  1.05      0.014     0.005     0.000     0.009     1748   GraphQL::Schema::MiddlewareChain#initialize
  0.69      0.006     0.003     0.000     0.002     1833   Kernel#initialize_dup
  0.46      0.002     0.002     0.000     0.000     1768   Array#initialize_copy
  0.30      0.001     0.001     0.000     0.000      419   GraphQL::Execution::SelectionResult#initialize
  0.28      0.001     0.001     0.000     0.000      466   Hash#initialize
  0.17      0.010     0.001     0.000     0.009       92   GraphQL::InternalRepresentation::Selection#initialize
  0.15      0.002     0.001     0.000     0.001      162   Set#initialize
  0.15      0.001     0.001     0.000     0.000       70   GraphQL::InternalRepresentation::Node#initialize
  0.07      0.001     0.000     0.000     0.001       58   GraphQL::StaticValidation::FieldsWillMerge::FieldDefinitionComparison#initialize
  0.04      0.001     0.000     0.000     0.000       64   GraphQL::Query::Arguments#initialize
  0.01      0.000     0.000     0.000     0.000       11   GraphQL::StaticValidation::FragmentsAreUsed::FragmentInstance#initialize
  0.01      0.000     0.000     0.000     0.000        1   GraphQL::Query#initialize
# ... truncated ...

Lots of GraphQL internals! That’s good news though: those are within scope for optimization.

MiddlewareChain was ripe for a refactor. In the old implementation, each field resolution created a middleware chain, then used it and discarded it. However, this was a waste of objects. Middlewares don’t change during query execution, so we should be able to reuse the same list of middlewares for each field.

This required a bit of refactoring, since the old implementation modified the array (with shift) as it worked through middlewares. In the end, this improvement was added in 5549e0cf. As a bonus, the number of created Arrays (shown by Array#initialize_copy) also declined tremendously since they were used for MiddlewareChain’s internal state. Also, calls to Array#shift were removed, since the array was no longer modified:

1
2
3
4
~/code/graphql $ cat 130_prof.txt | grep shift
  0.61      0.003     0.003     0.000     0.000     3496   Array#shift
~/code/graphql $ cat 150_prof.txt | grep shift
~/code/graphql $

🎉 !

The number FieldResult objects was also reduced. FieldResult is used for execution bookkeeping in some edge cases, but is often unneeded. So, we could optimize by removing the FieldResult object when we had a plain value (and therefore no bookkeeping was needed): 07cbfa89

A very modest optimization was also applied to GraphQL::Arguments, reusing the same object for empty argument lists (4b07c9b4) and reusing the argument default values on a field-level basis (4956149d).

Avoid Duplicate Calculations

Some elements of a GraphQL schema don’t change during execution. As long as this holds true, we can cache the results of some calculations and avoid recalculating them.

A simple caching approach is to use a hash whose keys are the inputs and whose values are the cached outputs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Read-through cache for summing two numbers
#
# The first layer of the cache is the left-hand number:
read_through_sum = Hash.new do |hash1, left_num|
  # The second layer of the cache is the right-hand number:
  hash1[num1] = Hash.new do |hash2, right_num|

    # And finally, the result is stored as a value in the second hash:
    puts "Adding #{left_num} + #{right_num}"
    hash2[right_num] = left_num + right_num
  end
end

read_through_sum[1][2]
# "Adding 1 + 2"
# => 3

read_through_sum[1][2]
# => 3

The first lookup printed a message and returned a value but the second lookup did not print a value. This is because the block wasn’t called. Instead, the cached value was returned immediately.

This approach was applied aggressively to GraphQL::Schema::Warden, an object which manages schema visibility on a query-by-query basis. Since the visibility of a schema member would remain constant during the query, we could cache the results of visibility checks: first 1a28b104, then 27b36e89.

This was also applied to field lookup in 133ed1b1e and to lazy_resolve handler lookup in 283fc19d.

Use yield Instead of &block

Due to the implementation of Ruby’s VM, calling a block with yield is much faster than block.call. @theorygeek migrated MiddlewareChain to use that approach instead in 517cec34.

Remove Overhead from Lazy Definition API (warning: terrible hack)

In order to handle circular definitions, graphql-ruby’s .define { ... } blocks aren’t executed immediately. Instead, they’re stored and evaluated only when a definition-dependent value is required. To achieve this, all definition-dependent methods were preceeded by a call to ensure_defined.

Maybe you remember that method from the very top of the profiler output above:

1
2
3
 %self      total      self      wait     child     calls  name
  4.60      0.074     0.022     0.000     0.052     6893  *Class#new
  3.99      0.019     0.019     0.000     0.000     8715  *GraphQL::Define::InstanceDefinable#ensure_defined

A fact about GraphQL::Schema is that, by the time it is defined, all lazy definitions have been executed. This means that during query execution, calling ensure_defined is always a waste!

I found a way to remove the overhead, but it was a huge hack. It works like this:

When a definition is added (with .define):

  • store the definition block for later
  • find each definition-dependent method definition on the defined object and gather them into an array:

    @pending_methods = method_names.map { |n| self.class.instance_method(n) }
    
  • replace those methods with dummy methods which:
    • call ensure_defined
    • re-apply all @pending_methods, overriding the dummy methods
    • call the real method (which was just re-applied)

This way, subsequent calls to definition-dependent methods don’t call ensure_defined. ensure_defined removed itself from the class definition after its work was done!

You can see the whole hack in 18d73a58. For all my teasing, this is something that makes Ruby so powerful: if you can imagine it, you can code it!

The Final Product

Two minor releases later, the profile output is looking better! Here’s the output on master:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Thread ID: 70178713115080
Fiber ID: 70178720382840
Total: 0.310395
Sort by: self_time

 %self      total      self      wait     child     calls  name
  4.06      0.013     0.013     0.000     0.000     7644   Kernel#hash
  2.93      0.021     0.009     0.000     0.012     2917  *Class#new
  2.89      0.009     0.009     0.000     0.000     4184   GraphQL::Query::Context::FieldResolutionContext#query
  2.74      0.009     0.009     0.000     0.000    13542   Module#===
  2.60      0.008     0.008     0.000     0.000     1960   GraphQL::Query::Context::FieldResolutionContext#initialize
  2.27      0.013     0.007     0.000     0.006     1748   GraphQL::Query#arguments_for
  2.25      0.010     0.007     0.000     0.003     7523   Kernel#===
  2.14      0.022     0.007     0.000     0.015     1959   GraphQL::Query::Context::FieldResolutionContext#spawn
  2.09      0.007     0.007     0.000     0.000     8260   Kernel#is_a?
  1.87      0.039     0.006     0.000     0.033     1748   GraphQL::Schema::RescueMiddleware#call
  1.75      0.013     0.005     0.000     0.008     1748   GraphQL::Execution::Lazy::LazyMethodMap#get
  1.69      0.005     0.005     0.000     0.000     2259   Kernel#class
  1.68      0.044     0.005     0.000     0.039     3496  *GraphQL::Schema::MiddlewareChain#invoke_core
  1.33      0.004     0.004     0.000     0.000     1747   GraphQL::Query::Context::FieldResolutionContext#schema
  1.31      0.029     0.004     0.000     0.025     1748   <Module::GraphQL::Execution::Execute::FieldResolveStep>#call
  1.20      0.004     0.004     0.000     0.000     1748   GraphQL::Execution::SelectionResult#set
  1.15      0.048     0.004     0.000     0.044     1748   GraphQL::Schema::MiddlewareChain#invoke
  1.14      0.017     0.004     0.000     0.013     1748   GraphQL::Schema#lazy_method_name
  1.07      0.004     0.003     0.000     0.001     1044   Kernel#public_send
  1.05      0.020     0.003     0.000     0.017     1748   GraphQL::Schema#lazy?
  1.02      0.004     0.003     0.000     0.000     1806   GraphQL::InternalRepresentation::Node#definition

Here are the wins:

  • Object allocations reduced by 58%
  • Method calls to gem code and Ruby built-ins reduced by … a lot!
  • Calls to ensure_defined reduced by 100% 😆

And, as shown in the benchmark above, 1.7x faster query execution!

There’s one caveat: these optimization apply to the GraphQL runtime only. Real GraphQL performance depends on more than that. It includes application-specific details like database access, remote API calls and application code performance.