hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

No description

Find a file

Stephanie Wang 55a0f7bb2d [core] ray.init defaults to an existing Ray instance if there is one (#26678 ) ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through `ray start`. If none is found, we will start a new one. This makes two changes to the ray.init() resolution order: 1. When `ray start` is called, the started cluster address was already written to a file called `/tmp/ray/ray_current_cluster`. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on `ray stop`. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None. 2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for `ray.client` use cases. This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init: ``` (Connecting to existing Ray cluster at address: <IP>...) ...connection... (Started a local Ray cluster.\| Connected to Ray Cluster.)( View the dashboard at <URL>) ``` Note that this changes the dashboard URL to be printed with `ray.init()` instead of when the dashboard is first started. Co-authored-by: Eric Liang <ekhliang@gmail.com>		2022-07-23 11:27:22 -07:00
.buildkite	Allow grpcio >= 1.48 (#26765 )	2022-07-21 10:03:41 -07:00
.github	Remove @mwtian from code owners #26769	2022-07-19 22:39:29 -07:00
.gitpod	[CI] Check test files for `if __name__...` snippet (#25322 )	2022-06-02 10:30:00 +01:00
bazel	Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" (#26525 )	2022-07-18 21:21:19 -07:00
binder	run code in browser (#22727 )	2022-03-02 10:27:00 +01:00
ci	Revert "[ci] fix determine_tests_to_run.py by finding merge base (#26790 )" (#26799 )	2022-07-20 22:12:13 +01:00
cpp	[Core][C++ worker] Add GetNamespace api (#26509 )	2022-07-20 11:17:14 +08:00
dashboard	[core] ray.init defaults to an existing Ray instance if there is one (#26678 )	2022-07-23 11:27:22 -07:00
deploy	Exposed upscaling_speed and idle_timeout_minutes to values.yaml, #25312 (#25495 )	2022-06-06 13:26:06 -04:00
doc	[core] ray.init defaults to an existing Ray instance if there is one (#26678 )	2022-07-23 11:27:22 -07:00
docker	Fix redis dependency (#26459 )	2022-07-12 16:07:09 -07:00
java	Cleanup ActorContext due to multi actor instances got removed. (#26497 )	2022-07-15 23:30:09 +08:00
python	[core] ray.init defaults to an existing Ray instance if there is one (#26678 )	2022-07-23 11:27:22 -07:00
release	Revert "[RLlib] Fix apex breakout release test performance. (#26867 )" (#26927 )	2022-07-23 17:27:50 +02:00
rllib	[RLlib]: Raise deprecation warning in MARWIL OPE methods. (#26893 )	2022-07-23 13:55:40 +02:00
scripts	[CI] Add bazel py_test checking for Serve (#25509 )	2022-06-07 10:54:10 -07:00
src	[Core] Fix typo in logging during lineage reconstruction #26681	2022-07-20 11:18:12 -07:00
thirdparty	Revert "Revert "[grpc] Upgrade grpc to 1.45.2"" (#24201 )	2022-04-26 10:49:54 -07:00
.bazelrc	[Core] Set c++ terminate handler to print stack trace (#26444 )	2022-07-12 13:54:20 -07:00
.clang-format	[Lint] One parameter/argument per line for C++ code (#22725 )	2022-03-13 17:05:44 +08:00
.clang-tidy	[Lint] Disable `modernize-use-override` (#19368 )	2021-10-13 20:20:08 -07:00
.editorconfig	Improve .editorconfig entries (#7344 )	2020-02-26 19:05:36 -08:00
.flake8	[Streaming]Farewell : remove all of streaming related from ray repo. (#21770 )	2022-01-23 17:53:41 +08:00
.git-blame-ignore-revs	Create `.git-blame-ignore-revs` for black formatting (#25118 )	2022-05-23 21:55:57 -07:00
.gitignore	[docs] Improve AIR table of contents titles (#26858 )	2022-07-22 17:17:49 -07:00
.gitpod.yml	[dev] Enable gitpod (#15420 )	2021-04-21 13:26:46 -07:00
.isort.cfg	Update import sorting blacklist, enable sorting for experimental dir (#26101 )	2022-07-12 21:25:58 -07:00
build-docker.sh	Bump Ray Version from 2.0.0.dev0 to 3.0.0.dev0 (#24894 )	2022-05-17 19:31:05 -07:00
BUILD.bazel	[runtime env] plugin refactor[4/n]: remove runtime env protobuf (#26522 )	2022-07-15 13:56:12 +08:00
build.sh	Get rid of build shell scripts and move them to Python (#6082 )	2020-07-16 11:26:47 -05:00
CONTRIBUTING.rst	Link to the documentation on contributing from CONTRIBUTING.rst (#19396 )	2021-11-15 15:34:18 -08:00
LICENSE	[State Observability] Use a table format by default (#26159 )	2022-07-19 00:54:16 -07:00
pylintrc	RLLIB and pylintrc (#8995 )	2020-06-17 18:14:25 +02:00
README.rst	[air] update documentation to use `session.report` (#26051 )	2022-06-30 10:37:31 -07:00
SECURITY.md	Create SECURITY.md (#21521 )	2022-01-11 08:54:51 -08:00
setup_hooks.sh	[ci] Clean up ci/ directory (refactor ci/travis) (#23866 )	2022-04-13 18:11:30 +01:00
WORKSPACE	[CI] Bump Bazel version to 4.2.2 (#24242 )	2022-05-26 17:09:40 -07:00

README.rst

.. image:: https://github.com/ray-project/ray/raw/master/doc/source/images/ray_header_logo.png

.. image:: https://readthedocs.org/projects/ray/badge/?version=master
    :target: http://docs.ray.io/en/master/?badge=master

.. image:: https://img.shields.io/badge/Ray-Join%20Slack-blue
    :target: https://forms.gle/9TSdDYUgxYs8SA9e8

.. image:: https://img.shields.io/badge/Discuss-Ask%20Questions-blue
    :target: https://discuss.ray.io/

.. image:: https://img.shields.io/twitter/follow/raydistributed.svg?style=social&logo=twitter
    :target: https://twitter.com/raydistributed

|


**Ray provides a simple, universal API for building distributed applications.**

Ray is packaged with the following libraries for accelerating machine learning workloads:

- `Tune`_: Scalable Hyperparameter Tuning
- `RLlib`_: Scalable Reinforcement Learning
- `Train`_: Distributed Deep Learning (beta)
- `Datasets`_: Distributed Data Loading and Compute

As well as libraries for taking ML and distributed apps to production:

- `Serve`_: Scalable and Programmable Serving
- `Workflows`_: Fast, Durable Application Flows (alpha)

There are also many `community integrations <https://docs.ray.io/en/master/ray-libraries.html>`_ with Ray, including `Dask`_, `MARS`_, `Modin`_, `Horovod`_, `Hugging Face`_, `Scikit-learn`_, and others. Check out the `full list of Ray distributed libraries here <https://docs.ray.io/en/master/ray-libraries.html>`_.

Install Ray with: ``pip install ray``. For nightly wheels, see the
`Installation page <https://docs.ray.io/en/master/installation.html>`__.

.. _`Modin`: https://github.com/modin-project/modin
.. _`Hugging Face`: https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer.hyperparameter_search
.. _`MARS`: https://docs.ray.io/en/latest/data/mars-on-ray.html
.. _`Dask`: https://docs.ray.io/en/latest/data/dask-on-ray.html
.. _`Horovod`: https://horovod.readthedocs.io/en/stable/ray_include.html
.. _`Scikit-learn`: https://docs.ray.io/en/master/joblib.html
.. _`Serve`: https://docs.ray.io/en/master/serve/index.html
.. _`Datasets`: https://docs.ray.io/en/master/data/dataset.html
.. _`Workflows`: https://docs.ray.io/en/master/workflows/concepts.html
.. _`Train`: https://docs.ray.io/en/master/train/train.html


Quick Start
-----------

Execute Python functions in parallel.

.. code-block:: python

    import ray
    ray.init()

    @ray.remote
    def f(x):
        return x * x

    futures = [f.remote(i) for i in range(4)]
    print(ray.get(futures))

To use Ray's actor model:

.. code-block:: python


    import ray
    ray.init()

    @ray.remote
    class Counter(object):
        def __init__(self):
            self.n = 0

        def increment(self):
            self.n += 1

        def read(self):
            return self.n

    counters = [Counter.remote() for i in range(4)]
    [c.increment.remote() for c in counters]
    futures = [c.read.remote() for c in counters]
    print(ray.get(futures))


Ray programs can run on a single machine, and can also seamlessly scale to large clusters. To execute the above Ray script in the cloud, just download `this configuration file <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml>`__, and run:

``ray submit [CLUSTER.YAML] example.py --start``

Read more about `launching clusters <https://docs.ray.io/en/master/cluster/index.html>`_.

Tune Quick Start
----------------

.. image:: https://github.com/ray-project/ray/raw/master/doc/source/images/tune-wide.png

`Tune`_ is a library for hyperparameter tuning at any scale.

- Launch a multi-node distributed hyperparameter sweep in less than 10 lines of code.
- Supports any deep learning framework, including PyTorch, `PyTorch Lightning <https://github.com/williamFalcon/pytorch-lightning>`_, TensorFlow, and Keras.
- Visualize results with `TensorBoard <https://www.tensorflow.org/tensorboard>`__.
- Choose among scalable SOTA algorithms such as `Population Based Training (PBT)`_, `Vizier's Median Stopping Rule`_, `HyperBand/ASHA`_.
- Tune integrates with many optimization libraries such as `Facebook Ax <http://ax.dev>`_, `HyperOpt <https://github.com/hyperopt/hyperopt>`_, and `Bayesian Optimization <https://github.com/fmfn/BayesianOptimization>`_ and enables you to scale them transparently.

To run this example, you will need to install the following:

.. code-block:: bash

    $ pip install "ray[tune]"


This example runs a parallel grid search to optimize an example objective function.

.. code-block:: python

    from ray.air import session


    def objective(step, alpha, beta):
        return (0.1 + alpha * step / 100)**(-1) + beta * 0.1


    def training_function(config):
        # Hyperparameters
        alpha, beta = config["alpha"], config["beta"]
        for step in range(10):
            # Iterative training function - can be any arbitrary training procedure.
            intermediate_score = objective(step, alpha, beta)
            # Feed the score back back to Tune.
            session.report({"mean_loss": intermediate_score})


    analysis = tune.run(
        training_function,
        config={
            "alpha": tune.grid_search([0.001, 0.01, 0.1]),
            "beta": tune.choice([1, 2, 3])
        })

    print("Best config: ", analysis.get_best_config(metric="mean_loss", mode="min"))

    # Get a dataframe for analyzing trial results.
    df = analysis.results_df

If TensorBoard is installed, automatically visualize all trial results:

.. code-block:: bash

    tensorboard --logdir ~/ray_results

.. _`Tune`: https://docs.ray.io/en/master/tune.html
.. _`Population Based Training (PBT)`: https://docs.ray.io/en/master/tune/api_docs/schedulers.html#population-based-training-tune-schedulers-populationbasedtraining
.. _`Vizier's Median Stopping Rule`: https://docs.ray.io/en/master/tune/api_docs/schedulers.html#median-stopping-rule-tune-schedulers-medianstoppingrule
.. _`HyperBand/ASHA`: https://docs.ray.io/en/master/tune/api_docs/schedulers.html#asha-tune-schedulers-ashascheduler

RLlib Quick Start
-----------------

.. image:: https://github.com/ray-project/ray/raw/master/doc/source/rllib/images/rllib-logo.png

`RLlib`_ is an industry-grade library for reinforcement learning (RL), built on top of Ray.
It offers high scalability and unified APIs for a
`variety of industry- and research applications <https://www.anyscale.com/event-category/ray-summit>`_.

.. code-block:: bash

    $ pip install "ray[rllib]" tensorflow  # or torch


.. Do NOT edit the following code directly in this README! Instead, edit
    the ray/rllib/examples/documentation/rllib_on_ray_readme.py script and then
    copy the new code in here:

.. code-block:: python

    import gym
    from ray.rllib.algorithms.ppo import PPO


    # Define your problem using python and openAI's gym API:
    class SimpleCorridor(gym.Env):
        """Corridor in which an agent must learn to move right to reach the exit.

        ---------------------
        | S | 1 | 2 | 3 | G |   S=start; G=goal; corridor_length=5
        ---------------------

        Possible actions to chose from are: 0=left; 1=right
        Observations are floats indicating the current field index, e.g. 0.0 for
        starting position, 1.0 for the field next to the starting position, etc..
        Rewards are -0.1 for all steps, except when reaching the goal (+1.0).
        """

        def __init__(self, config):
            self.end_pos = config["corridor_length"]
            self.cur_pos = 0
            self.action_space = gym.spaces.Discrete(2)  # left and right
            self.observation_space = gym.spaces.Box(0.0, self.end_pos, shape=(1,))

        def reset(self):
            """Resets the episode and returns the initial observation of the new one.
            """
            self.cur_pos = 0
            # Return initial observation.
            return [self.cur_pos]

        def step(self, action):
            """Takes a single step in the episode given `action`

            Returns:
                New observation, reward, done-flag, info-dict (empty).
            """
            # Walk left.
            if action == 0 and self.cur_pos > 0:
                self.cur_pos -= 1
            # Walk right.
            elif action == 1:
                self.cur_pos += 1
            # Set `done` flag when end of corridor (goal) reached.
            done = self.cur_pos >= self.end_pos
            # +1 when goal reached, otherwise -1.
            reward = 1.0 if done else -0.1
            return [self.cur_pos], reward, done, {}


    # Create an RLlib Trainer instance.
    trainer = PPO(
        config={
            # Env class to use (here: our gym.Env sub-class from above).
            "env": SimpleCorridor,
            # Config dict to be passed to our custom env's constructor.
            "env_config": {
                # Use corridor with 20 fields (including S and G).
                "corridor_length": 20
            },
            # Parallelize environment rollouts.
            "num_workers": 3,
        })

    # Train for n iterations and report results (mean episode rewards).
    # Since we have to move at least 19 times in the env to reach the goal and
    # each move gives us -0.1 reward (except the last move at the end: +1.0),
    # we can expect to reach an optimal episode reward of -0.1*18 + 1.0 = -0.8
    for i in range(5):
        results = trainer.train()
        print(f"Iter: {i}; avg. reward={results['episode_reward_mean']}")


After training, you may want to perform action computations (inference) in your environment.
Here is a minimal example on how to do this. Also
`check out our more detailed examples here <https://github.com/ray-project/ray/tree/master/rllib/examples/inference_and_serving>`_
(in particular for `normal models <https://github.com/ray-project/ray/blob/master/rllib/examples/inference_and_serving/policy_inference_after_training.py>`_,
`LSTMs <https://github.com/ray-project/ray/blob/master/rllib/examples/inference_and_serving/policy_inference_after_training_with_lstm.py>`_,
and `attention nets <https://github.com/ray-project/ray/blob/master/rllib/examples/inference_and_serving/policy_inference_after_training_with_attention.py>`_).

.. code-block:: python

    # Perform inference (action computations) based on given env observations.
    # Note that we are using a slightly different env here (len 10 instead of 20),
    # however, this should still work as the agent has (hopefully) learned
    # to "just always walk right!"
    env = SimpleCorridor({"corridor_length": 10})
    # Get the initial observation (should be: [0.0] for the starting position).
    obs = env.reset()
    done = False
    total_reward = 0.0
    # Play one episode.
    while not done:
        # Compute a single action, given the current observation
        # from the environment.
        action = trainer.compute_single_action(obs)
        # Apply the computed action in the environment.
        obs, reward, done, info = env.step(action)
        # Sum up rewards for reporting purposes.
        total_reward += reward
    # Report results.
    print(f"Played 1 episode; total-reward={total_reward}")


.. _`RLlib`: https://docs.ray.io/en/master/rllib/index.html


Ray Serve Quick Start
---------------------

.. image:: https://raw.githubusercontent.com/ray-project/ray/master/doc/source/serve/logo.svg
  :width: 400

`Ray Serve`_ is a scalable model-serving library built on Ray. It is:

- Framework Agnostic: Use the same toolkit to serve everything from deep
  learning models built with frameworks like PyTorch or Tensorflow & Keras
  to Scikit-Learn models or arbitrary business logic.
- Python First: Configure your model serving declaratively in pure Python,
  without needing YAMLs or JSON configs.
- Performance Oriented: Turn on batching, pipelining, and GPU acceleration to
  increase the throughput of your model.
- Composition Native: Allow you to create "model pipelines" by composing multiple
  models together to drive a single prediction.
- Horizontally Scalable: Serve can linearly scale as you add more machines. Enable
  your ML-powered service to handle growing traffic.

To run this example, you will need to install the following:

.. code-block:: bash

    $ pip install scikit-learn
    $ pip install "ray[serve]"

This example runs serves a scikit-learn gradient boosting classifier.

.. code-block:: python

    import pickle
    import requests

    from sklearn.datasets import load_iris
    from sklearn.ensemble import GradientBoostingClassifier

    from ray import serve

    serve.start()

    # Train model.
    iris_dataset = load_iris()
    model = GradientBoostingClassifier()
    model.fit(iris_dataset["data"], iris_dataset["target"])

    @serve.deployment(route_prefix="/iris")
    class BoostingModel:
        def __init__(self, model):
            self.model = model
            self.label_list = iris_dataset["target_names"].tolist()

        async def __call__(self, request):
            payload = (await request.json())["vector"]
            print(f"Received flask request with data {payload}")

            prediction = self.model.predict([payload])[0]
            human_name = self.label_list[prediction]
            return {"result": human_name}


    # Deploy model.
    BoostingModel.deploy(model)

    # Query it!
    sample_request_input = {"vector": [1.2, 1.0, 1.1, 0.9]}
    response = requests.get("http://localhost:8000/iris", json=sample_request_input)
    print(response.text)
    # Result:
    # {
    #  "result": "versicolor"
    # }


.. _`Ray Serve`: https://docs.ray.io/en/master/serve/index.html

More Information
----------------

- `Documentation`_
- `Tutorial`_
- `Blog`_
- `Ray 1.0 Architecture whitepaper`_ **(new)**
- `Exoshuffle: large-scale data shuffle in Ray`_ **(new)**
- `RLlib paper`_
- `RLlib flow paper`_
- `Tune paper`_

*Older documents:*

- `Ray paper`_
- `Ray HotOS paper`_

.. _`Documentation`: http://docs.ray.io/en/master/index.html
.. _`Tutorial`: https://github.com/ray-project/tutorial
.. _`Blog`: https://medium.com/distributed-computing-with-ray
.. _`Ray 1.0 Architecture whitepaper`: https://docs.google.com/document/d/1lAy0Owi-vPz2jEqBSaHNQcy2IBSDEHyXNOQZlGuj93c/preview
.. _`Exoshuffle: large-scale data shuffle in Ray`: https://arxiv.org/abs/2203.05072
.. _`Ray paper`: https://arxiv.org/abs/1712.05889
.. _`Ray HotOS paper`: https://arxiv.org/abs/1703.03924
.. _`RLlib paper`: https://arxiv.org/abs/1712.09381
.. _`RLlib flow paper`: https://arxiv.org/abs/2011.12719
.. _`Tune paper`: https://arxiv.org/abs/1807.05118

Getting Involved
----------------

.. list-table::
   :widths: 25 50 25 25
   :header-rows: 1

   * - Platform
     - Purpose
     - Estimated Response Time
     - Support Level
   * - `Discourse Forum`_
     - For discussions about development and questions about usage.
     - < 1 day
     - Community
   * - `GitHub Issues`_
     - For reporting bugs and filing feature requests.
     - < 2 days
     - Ray OSS Team
   * - `Slack`_
     - For collaborating with other Ray users.
     - < 2 days
     - Community
   * - `StackOverflow`_
     - For asking questions about how to use Ray.
     - 3-5 days
     - Community
   * - `Meetup Group`_
     - For learning about Ray projects and best practices.
     - Monthly
     - Ray DevRel
   * - `Twitter`_
     - For staying up-to-date on new features.
     - Daily
     - Ray DevRel

.. _`Discourse Forum`: https://discuss.ray.io/
.. _`GitHub Issues`: https://github.com/ray-project/ray/issues
.. _`StackOverflow`: https://stackoverflow.com/questions/tagged/ray
.. _`Meetup Group`: https://www.meetup.com/Bay-Area-Ray-Meetup/
.. _`Twitter`: https://twitter.com/raydistributed
.. _`Slack`: https://forms.gle/9TSdDYUgxYs8SA9e8