Backends define the implementation of your business logic or models that will handle incoming requests.
In order to support seamless scalability backends can have many replicas, which are individual processes running in the Ray cluster to handle requests.
To define a backend, you must first define the "handler" or the business logic you'd like to respond with.
The handler should take as input a `Starlette Request object <https://www.starlette.io/requests/>`_ and return any JSON-serializable object as output. For a more customizable response type, the handler may return a
A backend is defined using :mod:`create_backend <ray.serve.api.create_backend>`, and the implementation can be defined as either a function or a class.
A backend consists of a number of *replicas*, which are individual copies of the function or class that are started in separate Ray Workers (processes).
..code-block:: python
def handle_request(starlette_request):
return "hello world"
class RequestHandler:
# Take the message to return as an argument to the constructor.
While backends define the implementation of your request handling logic, endpoints allow you to expose them via HTTP.
Endpoints are "logical" and can have one or multiple backends that serve requests to them.
To create an endpoint, we simply need to specify a name for the endpoint, the name of a backend to handle requests to the endpoint, and the route and methods where it will be accesible.
By default endpoints are serviced only by the backend provided to :mod:`create_endpoint <ray.serve.api.create_endpoint>`, but in some cases you may want to specify multiple backends for an endpoint, e.g., for A/B testing or incremental rollout.
Deep learning models like PyTorch and Tensorflow often use multithreading when performing inference.
The number of CPUs they use is controlled by the OMP_NUM_THREADS environment variable.
To :ref:`avoid contention<omp-num-thread-note>`, Ray sets ``OMP_NUM_THREADS=1`` by default because Ray workers and actors use a single CPU by default.
If you *do* want to enable this parallelism in your Serve backend, just set OMP_NUM_THREADS to the desired value either when starting Ray or in your function/class definition:
..code-block:: bash
OMP_NUM_THREADS=12 ray start --head
OMP_NUM_THREADS=12 ray start --address=$HEAD_NODE_ADDRESS
Some other libraries may not respect ``OMP_NUM_THREADS`` and have their own way to configure parallelism.
For example, if you're using OpenCV, you'll need to manually set the number of threads using ``cv2.setNumThreads(num_threads)`` (set to 0 to disable multi-threading).
You can check the configuration using ``cv2.getNumThreads()`` and ``cv2.getNumberOfCPUs()``.
User Configuration (Experimental)
---------------------------------
Suppose you want to update a parameter in your model without creating a whole
new backend. You can do this by writing a `reconfigure` method for the class
underlying your backend. At runtime, you can then pass in your new parameters
by setting the `user_config` field of :mod:`BackendConfig <ray.serve.BackendConfig>`.
The following simple example will make the usage clear: