hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 10:01:43 -05:00

No description

Find a file

Kai Fricke d0678b80ed [rfc] [air/tune/train] Improve trial/training failure error printing (#27946 ) When training fails, the console output is currently cluttered with tracebacks which are hard to digest. This problem is exacerbated when running multiple trials in a tuning run. The main problems here are: 1. Tracebacks are printed multiple times: In the remote worker and on the driver 2. Tracebacks include many internal wrappers The proposed solution for 1 is to only print tracebacks once (on the driver) or never (if configured). The proposed solution for 2 is to shorten the tracebacks to include mostly user-provided code. ### Deduplicating traceback printing The solution here is to use `logger.error` instead of `logger.exception` in the `function_trainable.py` to avoid printing a traceback in the trainable. Additionally, we introduce an environment variable `TUNE_PRINT_ALL_TRIAL_ERRORS` which defaults to 1. If set to 0, trial errors will not be printed at all in the console (only the error.txt files will exist). To be discussed: We could also default this to 0, but I think the expectation is to see at least some failure output in the console logs per default. ### Removing internal wrappers from tracebacks The solution here is to introcude a magic local variable `_ray_start_tb`. In two places, we use this magic local variable to reduce the stacktrace. A utility `shorten_tb` looks for the last occurence of `_ray_start_tb` in the stacktrace and starts the traceback from there. This takes only linear time. If the magic variable is not present, the full traceback is returned - this means that if the error does not come up in user code, the full traceback is returned, giving visibility in possible internal bugs. Additionally there is an env variable `RAY_AIR_FULL_TRACEBACKS` which disables traceback shortening. Signed-off-by: Kai Fricke <kai@anyscale.com>		2022-08-26 15:02:38 -07:00
.buildkite	Add support for Python 3.10 (#21221 )	2022-08-26 11:01:12 -07:00
.github	[Datasets] Add Cheng as code owner of data (#27912 )	2022-08-18 10:01:21 -07:00
.gitpod	[CI] Check test files for `if __name__...` snippet (#25322 )	2022-06-02 10:30:00 +01:00
bazel	Revert "update grpc to 1.48.0 (#23246 )" (#28101 )	2022-08-25 09:37:10 -07:00
binder	run code in browser (#22727 )	2022-03-02 10:27:00 +01:00
ci	Add support for Python 3.10 (#21221 )	2022-08-26 11:01:12 -07:00
cpp	[C++ worker]Support ActorHandle type return value (#28077 )	2022-08-25 10:05:05 +08:00
dashboard	[hotfix] Fix pytest dependency in test_utils (#27956 )	2022-08-17 12:16:08 -07:00
deploy	expose imagePullSecret to values.yaml (#27537 )	2022-08-20 06:53:55 -07:00
doc	[rfc] [air/tune/train] Improve trial/training failure error printing (#27946 )	2022-08-26 15:02:38 -07:00
docker	Add support for Python 3.10 (#21221 )	2022-08-26 11:01:12 -07:00
java	[runtime env][java] Support runtime env config in Java (#28083 )	2022-08-26 08:37:39 +08:00
python	[rfc] [air/tune/train] Improve trial/training failure error printing (#27946 )	2022-08-26 15:02:38 -07:00
release	[Core][State Observability] Release test app configs to bypass default limit (#27969 )	2022-08-24 18:41:54 -07:00
rllib	[RLlib] Tolerate nan metrics in LearnerInfoBuilder. (#27981 )	2022-08-23 10:07:32 -07:00
scripts	[CI] Add bazel py_test checking for Serve (#25509 )	2022-06-07 10:54:10 -07:00
src	[runtime env][java] Support runtime env config in Java (#28083 )	2022-08-26 08:37:39 +08:00
thirdparty	Revert "Revert "[grpc] Upgrade grpc to 1.45.2"" (#24201 )	2022-04-26 10:49:54 -07:00
.bazelrc	[runtime env] plugin refactor[6/n]: java api refactor (#26783 )	2022-07-26 09:00:57 +08:00
.clang-format	[Lint] One parameter/argument per line for C++ code (#22725 )	2022-03-13 17:05:44 +08:00
.clang-tidy	[Lint] Disable `modernize-use-override` (#19368 )	2021-10-13 20:20:08 -07:00
.editorconfig	Improve .editorconfig entries (#7344 )	2020-02-26 19:05:36 -08:00
.flake8	[Streaming]Farewell : remove all of streaming related from ray repo. (#21770 )	2022-01-23 17:53:41 +08:00
.git-blame-ignore-revs	Create `.git-blame-ignore-revs` for black formatting (#25118 )	2022-05-23 21:55:57 -07:00
.gitignore	[Core] Unrevert "Add retry exception allowlist for user-defined filtering of retryable application-level errors." (#26449 )	2022-08-05 16:07:13 -07:00
.gitpod.yml	[dev] Enable gitpod (#15420 )	2021-04-21 13:26:46 -07:00
.isort.cfg	Update import sorting blacklist, enable sorting for experimental dir (#26101 )	2022-07-12 21:25:58 -07:00
build-docker.sh	Bump Ray Version from 2.0.0.dev0 to 3.0.0.dev0 (#24894 )	2022-05-17 19:31:05 -07:00
BUILD.bazel	Replace boost::filesystem with std::filesystem (#27522 )	2022-08-04 21:33:51 -07:00
build.sh	Get rid of build shell scripts and move them to Python (#6082 )	2020-07-16 11:26:47 -05:00
CONTRIBUTING.rst	Link to the documentation on contributing from CONTRIBUTING.rst (#19396 )	2021-11-15 15:34:18 -08:00
LICENSE	[State Observability] Use a table format by default (#26159 )	2022-07-19 00:54:16 -07:00
pylintrc	RLLIB and pylintrc (#8995 )	2020-06-17 18:14:25 +02:00
README.rst	[docs] Add the AIR technical whitepaper to our docs (#28053 )	2022-08-22 16:41:51 -07:00
SECURITY.md	Create SECURITY.md (#21521 )	2022-01-11 08:54:51 -08:00
setup_hooks.sh	[ci] Clean up ci/ directory (refactor ci/travis) (#23866 )	2022-04-13 18:11:30 +01:00
WORKSPACE	[CI] Bump Bazel version to 4.2.2 (#24242 )	2022-05-26 17:09:40 -07:00

README.rst

.. image:: https://github.com/ray-project/ray/raw/master/doc/source/images/ray_header_logo.png

.. image:: https://readthedocs.org/projects/ray/badge/?version=master
    :target: http://docs.ray.io/en/master/?badge=master

.. image:: https://img.shields.io/badge/Ray-Join%20Slack-blue
    :target: https://forms.gle/9TSdDYUgxYs8SA9e8

.. image:: https://img.shields.io/badge/Discuss-Ask%20Questions-blue
    :target: https://discuss.ray.io/

.. image:: https://img.shields.io/twitter/follow/raydistributed.svg?style=social&logo=twitter
    :target: https://twitter.com/raydistributed

|

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for simplifying ML compute:

.. image:: https://github.com/ray-project/ray/raw/master/doc/source/images/what-is-ray-padded.svg

..
  https://docs.google.com/drawings/d/1Pl8aCYOsZCo61cmp57c7Sja6HhIygGCvSZLi_AuBuqo/edit

Learn more about `Ray AIR`_ and its libraries:

- `Datasets`_: Distributed Data Preprocessing
- `Train`_: Distributed Training
- `Tune`_: Scalable Hyperparameter Tuning
- `RLlib`_: Scalable Reinforcement Learning
- `Serve`_: Scalable and Programmable Serving

Or more about `Ray Core`_ and its key abstractions:

- `Tasks`_: Stateless functions executed in the cluster.
- `Actors`_: Stateful worker processes created in the cluster.
- `Objects`_: Immutable values accessible across the cluster.

Ray runs on any machine, cluster, cloud provider, and Kubernetes, and features a growing
`ecosystem of community integrations`_.

Install Ray with: ``pip install ray``. For nightly wheels, see the
`Installation page <https://docs.ray.io/en/latest/installation.html>`__.

.. _`Serve`: https://docs.ray.io/en/latest/serve/index.html
.. _`Datasets`: https://docs.ray.io/en/latest/data/dataset.html
.. _`Workflow`: https://docs.ray.io/en/latest/workflows/concepts.html
.. _`Train`: https://docs.ray.io/en/latest/train/train.html
.. _`Tune`: https://docs.ray.io/en/latest/tune/index.html
.. _`RLlib`: https://docs.ray.io/en/latest/rllib/index.html
.. _`ecosystem of community integrations`: https://docs.ray.io/en/latest/ray-overview/ray-libraries.html


Why Ray?
--------

Today's ML workloads are increasingly compute-intensive. As convenient as they are, single-node development environments such as your laptop cannot scale to meet these demands.

Ray is a unified way to scale Python and AI applications from a laptop to a cluster.

With Ray, you can seamlessly scale the same code from a laptop to a cluster. Ray is designed to be general-purpose, meaning that it can performantly run any kind of workload. If your application is written in Python, you can scale it with Ray, no other infrastructure required.

More Information
----------------

- `Documentation`_
- `Ray Architecture whitepaper`_
- `Ray AIR Technical whitepaper`_
- `Exoshuffle: large-scale data shuffle in Ray`_
- `Ownership: a distributed futures system for fine-grained tasks`_
- `RLlib paper`_
- `Tune paper`_

*Older documents:*

- `Ray paper`_
- `Ray HotOS paper`_

.. _`Ray AIR`: https://docs.ray.io/en/latest/ray-air/getting-started.html
.. _`Ray Core`: https://docs.ray.io/en/latest/ray-core/walkthrough.html
.. _`Tasks`: https://docs.ray.io/en/latest/ray-core/tasks.html
.. _`Actors`: https://docs.ray.io/en/latest/ray-core/actors.html
.. _`Objects`: https://docs.ray.io/en/latest/ray-core/objects.html
.. _`Documentation`: http://docs.ray.io/en/latest/index.html
.. _`Ray Architecture whitepaper`: https://docs.google.com/document/d/1lAy0Owi-vPz2jEqBSaHNQcy2IBSDEHyXNOQZlGuj93c/preview
.. _`Ray AIR Technical whitepaper`: https://docs.google.com/document/d/1bYL-638GN6EeJ45dPuLiPImA8msojEDDKiBx3YzB4_s/preview
.. _`Exoshuffle: large-scale data shuffle in Ray`: https://arxiv.org/abs/2203.05072
.. _`Ownership: a distributed futures system for fine-grained tasks`: https://www.usenix.org/system/files/nsdi21-wang.pdf
.. _`Ray paper`: https://arxiv.org/abs/1712.05889
.. _`Ray HotOS paper`: https://arxiv.org/abs/1703.03924
.. _`RLlib paper`: https://arxiv.org/abs/1712.09381
.. _`Tune paper`: https://arxiv.org/abs/1807.05118

Getting Involved
----------------

.. list-table::
   :widths: 25 50 25 25
   :header-rows: 1

   * - Platform
     - Purpose
     - Estimated Response Time
     - Support Level
   * - `Discourse Forum`_
     - For discussions about development and questions about usage.
     - < 1 day
     - Community
   * - `GitHub Issues`_
     - For reporting bugs and filing feature requests.
     - < 2 days
     - Ray OSS Team
   * - `Slack`_
     - For collaborating with other Ray users.
     - < 2 days
     - Community
   * - `StackOverflow`_
     - For asking questions about how to use Ray.
     - 3-5 days
     - Community
   * - `Meetup Group`_
     - For learning about Ray projects and best practices.
     - Monthly
     - Ray DevRel
   * - `Twitter`_
     - For staying up-to-date on new features.
     - Daily
     - Ray DevRel

.. _`Discourse Forum`: https://discuss.ray.io/
.. _`GitHub Issues`: https://github.com/ray-project/ray/issues
.. _`StackOverflow`: https://stackoverflow.com/questions/tagged/ray
.. _`Meetup Group`: https://www.meetup.com/Bay-Area-Ray-Meetup/
.. _`Twitter`: https://twitter.com/raydistributed
.. _`Slack`: https://forms.gle/9TSdDYUgxYs8SA9e8