Robert Mosolgo

GraphQL Dataloader

GraphQL-Ruby 1.12 ships with a new GraphQL::Dataloader feature for batch-loading data.

It uses Ruby’s Fiber API to manage data dependencies without any intermediary proxy or promise objects. You can enable it in your schema with:

class MySchema < GraphQL::Schema
  # enable the new Fiber-backed batch loader
  use GraphQL::Dataloader
end

This blog post doesn’t cover how to use GraphQL::Dataloader, but you can find those docs on GraphQL-Ruby’s website.

Below, we’ll investigate how it works, why I chose Fibers, and a list of caveats.

Many thanks to Matt Bessey, whose proof-of-concept began my work on this feature.

How It Works

Basically, the GraphQL::Dataloader has three steps:

  1. Run GraphQL execution, but queue up jobs to actually resolve fields or load arguments
  2. Spin up Fibers to pull jobs that resolve fields (or load arguments). Those jobs re-enter Step 1 after the user code (field execution) has been called.

    Steps 1 and 2 happen concurrently: GraphQL execution queues up jobs, then the job runner starts running those jobs. The jobs themselves do two things:

    • Run application-defined code,
    • Then, re-enter GraphQL execution, which queues up more jobs

    However, the Fibers that run those jobs may pause at any time. They may call Fiber.yield, which returns control to the parent Fiber. When this happens, the parent Fiber checks if the job queue has any remaining jobs; if it does, it creates a new Fiber to run those jobs until that Fiber pauses or no jobs remain on the queue.

    Jobs are drawn from the queue until no jobs remain. Any paused Fibers are stored for resuming later.

  3. After the job queue is exhuasted, GraphQL::Dataloader initiates batched calls to external sources. After each source returns, it updates its own cache of results.

    When all external data sources have made their calls and populated their caches, the Fibers created by Step 2 are each resumed (once), which begins the back-and-forth between Step 1 and Step 2 above.

Let’s look at each of those steps more closely.

Step 1: GraphQL Execution

There are some parts of GraphQL that never require external data. For example:

  • Merging selection sets from various parts of the query
  • Enumerating the selections on an object and finding the corresponding field definitions
  • Checking if the application’s returned value for a field was nil
  • Adding AST location info to an execution error

These operations can run without regard for batch loading. However, execution continues to operations which may require external data:

  • Loading objects by ID (eg argument :post_id, ... loads: Types::Post)
  • Resolving fields (eg field :author)

In order to support batch loading, these operations are captured in jobs (which are Procs) and pushed on to a queue (an Array) to be run later.

Before long, GraphQL-Ruby runs out of operations of the first kind (no data requirement) and queues up some jobs of the second kind (may require data).

At this point, GraphQL::Dataloader “runs” for the first time.

Step 2: Run Jobs

A “job” is a Proc which may call dataloader.yield. By calling dataloader.yield, the job tells GraphQL::Dataloader “I am waiting for some batch-loaded data.” Under the hood, dataloader.yield calls Fiber.yield, which causes the job to pause in-place – no further Ruby code will run until that Fiber is manually resumed.

A field might pause by calling .load on a source:

That is, fields call “sources” like this:

field :author, Types::Author, null: true

def author
  Sources::Author.load(object.author_id)
end

Source#load is implemented to register a request for data, yield, then return the loaded value:

class GraphQL::Dataloader::Source
  def load(key)
    if results.key?(key)
      results[key]
    else
      pending_keys << key
      dataloader.yield
      results[key]
    end
  end
end

In that way, .load assumes that, after calling dataloader.yield, its cache will have been populated for any pending_keys.

Jobs that don’t pause will re-enter Step 1. That’s because the job contains a call to continue GraphQL execution. For example:

dataloader.append_job {
  # Call user code
  result = graphql_field.call(object, arguments, context)
  # Store the return value
  update_response(path, result)
  # Continue evaluating the query
  continue_executing(graphql_field.return_type, result, context)
}

However, that continued GraphQL execution will eventually run out of no-data-requirement work to do, and may enqueue new jobs along the way. So, running jobs may cause more jobs to be added to the queue.

Jobs are run by a collection of Fibers, basically like this:

while pending_jobs.any?
  f = Fiber.new {
    while (job = pending_jobs.shift)
      job.call
    end
  }
  f.resume
  if f.alive?
    paused_fibers << f
  end
end

In the end, we’re left with:

  • An empty job queue. (If there was more work that wasn’t waiting for batched data, we’d want to run it.)
  • A set of Fibers who yielded, and don’t want to be resumed until their data is ready

Step 3: Batch-load external data

Once we reach the end of the job queue, then we’re left with a set of paused jobs, waiting for data to be ready. GraphQL::Dataloader responds by triggering batch loads for each “source” who received a request for data. (A GraphQL::Dataloader::Source is a “kind” of data that can be batch-loaded.)

It looks kind of like this:

pending_sources.each do |source|
  source.load_requested_data
end

However, two factors complicate this:

  • A source may call out to another source. In this case, the source calls dataloader.yield.
  • When that happens, we should take care to call the dependency before resuming the dependent source.

So, those factors are addressed by:

  • Batch-loading data inside Fibers, so that we can control their flow with dataloader.yield and fiber.resume.
  • Using a stack instead of a queue, so that the must “urgent” sources are run next.

It ends up looking more like this:

def create_source_fiber
  # This fiber will trigger batch loads until it runs out of
  # pending sources or `yield`s
  Fiber.new {
    while (source = pending_sources.shift)
      source.load_requested_data
    end
  }
end

source_fiber_stack = [create_source_fiber]
while (f = source_fiber_stack.pop)
  f.resume
  # If the fiber paused _in the middle_ of resolving data,
  # put the fiber back on the stack.
  if f.alive?
    source_fiber_stack << f
  end
  # But if there are any _more_ sources to run, make a new fiber
  # to run those sources _before_ resuming the paused one
  if pending_sources.any?
    source_fiber_stack << create_source_fiber
  end
end

When that concludes, we’ll know that no more batch loading sources are pending.

And Around Again

At that point, any Fibers that dataloader.yielded in Step 2 can be resumed. load_requested_data will have populated internal caches, which will return values to those Fibers who requested them.

As described in Step 2, those Fibers will interleave GraphQL execution and application code until they all finish, or they all pause to wait for data.

And so on!

What About Promises?

In fact, there’s basically a third kind of operation which can depend on external data loading, which is .syncing Promises (as returned by the venerable batch-loading library GraphQL::Batch.) For this reason, the cycle described above also includes the Promise-syncing part of GraphQL-Ruby’s runtime.

Why Fibers?

My interest in Fibers was motivated by two things:

As proven in the prototype of a Fiber-backed dataloader, it’s possible to build batch loading into GraphQL-Ruby without any proxy objects or promises. In my experience, those objects add a lot of cognitive overhead to the source code and, becuase they’re unfamiliar, lead to subtle bugs or misbehaviors that we aren’t great at identifying.

Additionally, the Fiber Scheduler API could give us parallel I/O “for free.” In that setup, Fibers that make I/O or system calls automatically call Fiber.yield. GraphQL::Dataloader could use that signal to start off with another Fiber, executing some other part of the GraphQL query. Beyond that, it looks like a Fiber scheduler should also be able to resume Fibers in an evented manner. In my implementation described above, a long-running I/O call would cause any Fibers “behind” it to wait. But with an evented schedulers, you could resume Fibers with short-running external calls even while the long-running ones are still waiting to return.

Caveats

Although I find these approaches really promising, I also see some possible trouble down the line:

  • Fibers are not well-adopted in Ruby/Rails. Thread.current[...] values are not assigned inside new Fibers, and lots of libraries and applications use that for “global” context. (GraphQL-Ruby did, before this feature was added!)
  • Fibers are hard to debug and profile. The backtrace of a Fiber begins with Fiber.new, so it loses some context. Ruby profilers (at least ruby-prof) might not play nice with Fibers.

I haven’t adopted this dataloader in my day job yet, but I hope I can make time to try it out soon and sort some of these out for myself.

Conclusion

GraphQL-Ruby’s new Fiber-backed dataloader offers a slick API and might bring parallel-by-default I/O to GraphQL execution.