No description
Find a file
Kai Fricke d0678b80ed
[rfc] [air/tune/train] Improve trial/training failure error printing (#27946)
When training fails, the console output is currently cluttered with tracebacks which are hard to digest. This problem is exacerbated when running multiple trials in a tuning run.

The main problems here are:

1. Tracebacks are printed multiple times: In the remote worker and on the driver
2. Tracebacks include many internal wrappers

The proposed solution for 1 is to only print tracebacks once (on the driver) or never (if configured).

The proposed solution for 2 is to shorten the tracebacks to include mostly user-provided code.

### Deduplicating traceback printing

The solution here is to use `logger.error` instead of `logger.exception` in the `function_trainable.py` to avoid printing a traceback in the trainable. 

Additionally, we introduce an environment variable `TUNE_PRINT_ALL_TRIAL_ERRORS` which defaults to 1. If set to 0, trial errors will not be printed at all in the console (only the error.txt files will exist).

To be discussed: We could also default this to 0, but I think the expectation is to see at least some failure output in the console logs per default.

### Removing internal wrappers from tracebacks

The solution here is to introcude a magic local variable `_ray_start_tb`. In two places, we use this magic local variable to reduce the stacktrace. A utility `shorten_tb` looks for the last occurence of `_ray_start_tb` in the stacktrace and starts the traceback from there. This takes only linear time. If the magic variable is not present, the full traceback is returned - this means that if the error does not come up in user code, the full traceback is returned, giving visibility in possible internal bugs. Additionally there is an env variable `RAY_AIR_FULL_TRACEBACKS` which disables traceback shortening.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-26 15:02:38 -07:00
.buildkite Add support for Python 3.10 (#21221) 2022-08-26 11:01:12 -07:00
.github [Datasets] Add Cheng as code owner of data (#27912) 2022-08-18 10:01:21 -07:00
.gitpod [CI] Check test files for if __name__... snippet (#25322) 2022-06-02 10:30:00 +01:00
bazel Revert "update grpc to 1.48.0 (#23246)" (#28101) 2022-08-25 09:37:10 -07:00
binder run code in browser (#22727) 2022-03-02 10:27:00 +01:00
ci Add support for Python 3.10 (#21221) 2022-08-26 11:01:12 -07:00
cpp [C++ worker]Support ActorHandle type return value (#28077) 2022-08-25 10:05:05 +08:00
dashboard [hotfix] Fix pytest dependency in test_utils (#27956) 2022-08-17 12:16:08 -07:00
deploy expose imagePullSecret to values.yaml (#27537) 2022-08-20 06:53:55 -07:00
doc [rfc] [air/tune/train] Improve trial/training failure error printing (#27946) 2022-08-26 15:02:38 -07:00
docker Add support for Python 3.10 (#21221) 2022-08-26 11:01:12 -07:00
java [runtime env][java] Support runtime env config in Java (#28083) 2022-08-26 08:37:39 +08:00
python [rfc] [air/tune/train] Improve trial/training failure error printing (#27946) 2022-08-26 15:02:38 -07:00
release [Core][State Observability] Release test app configs to bypass default limit (#27969) 2022-08-24 18:41:54 -07:00
rllib [RLlib] Tolerate nan metrics in LearnerInfoBuilder. (#27981) 2022-08-23 10:07:32 -07:00
scripts [CI] Add bazel py_test checking for Serve (#25509) 2022-06-07 10:54:10 -07:00
src [runtime env][java] Support runtime env config in Java (#28083) 2022-08-26 08:37:39 +08:00
thirdparty Revert "Revert "[grpc] Upgrade grpc to 1.45.2"" (#24201) 2022-04-26 10:49:54 -07:00
.bazelrc [runtime env] plugin refactor[6/n]: java api refactor (#26783) 2022-07-26 09:00:57 +08:00
.clang-format [Lint] One parameter/argument per line for C++ code (#22725) 2022-03-13 17:05:44 +08:00
.clang-tidy [Lint] Disable modernize-use-override (#19368) 2021-10-13 20:20:08 -07:00
.editorconfig Improve .editorconfig entries (#7344) 2020-02-26 19:05:36 -08:00
.flake8 [Streaming]Farewell : remove all of streaming related from ray repo. (#21770) 2022-01-23 17:53:41 +08:00
.git-blame-ignore-revs Create .git-blame-ignore-revs for black formatting (#25118) 2022-05-23 21:55:57 -07:00
.gitignore [Core] Unrevert "Add retry exception allowlist for user-defined filtering of retryable application-level errors." (#26449) 2022-08-05 16:07:13 -07:00
.gitpod.yml [dev] Enable gitpod (#15420) 2021-04-21 13:26:46 -07:00
.isort.cfg Update import sorting blacklist, enable sorting for experimental dir (#26101) 2022-07-12 21:25:58 -07:00
build-docker.sh Bump Ray Version from 2.0.0.dev0 to 3.0.0.dev0 (#24894) 2022-05-17 19:31:05 -07:00
BUILD.bazel Replace boost::filesystem with std::filesystem (#27522) 2022-08-04 21:33:51 -07:00
build.sh Get rid of build shell scripts and move them to Python (#6082) 2020-07-16 11:26:47 -05:00
CONTRIBUTING.rst Link to the documentation on contributing from CONTRIBUTING.rst (#19396) 2021-11-15 15:34:18 -08:00
LICENSE [State Observability] Use a table format by default (#26159) 2022-07-19 00:54:16 -07:00
pylintrc RLLIB and pylintrc (#8995) 2020-06-17 18:14:25 +02:00
README.rst [docs] Add the AIR technical whitepaper to our docs (#28053) 2022-08-22 16:41:51 -07:00
SECURITY.md Create SECURITY.md (#21521) 2022-01-11 08:54:51 -08:00
setup_hooks.sh [ci] Clean up ci/ directory (refactor ci/travis) (#23866) 2022-04-13 18:11:30 +01:00
WORKSPACE [CI] Bump Bazel version to 4.2.2 (#24242) 2022-05-26 17:09:40 -07:00

.. image:: https://github.com/ray-project/ray/raw/master/doc/source/images/ray_header_logo.png

.. image:: https://readthedocs.org/projects/ray/badge/?version=master
    :target: http://docs.ray.io/en/master/?badge=master

.. image:: https://img.shields.io/badge/Ray-Join%20Slack-blue
    :target: https://forms.gle/9TSdDYUgxYs8SA9e8

.. image:: https://img.shields.io/badge/Discuss-Ask%20Questions-blue
    :target: https://discuss.ray.io/

.. image:: https://img.shields.io/twitter/follow/raydistributed.svg?style=social&logo=twitter
    :target: https://twitter.com/raydistributed

|

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for simplifying ML compute:

.. image:: https://github.com/ray-project/ray/raw/master/doc/source/images/what-is-ray-padded.svg

..
  https://docs.google.com/drawings/d/1Pl8aCYOsZCo61cmp57c7Sja6HhIygGCvSZLi_AuBuqo/edit

Learn more about `Ray AIR`_ and its libraries:

- `Datasets`_: Distributed Data Preprocessing
- `Train`_: Distributed Training
- `Tune`_: Scalable Hyperparameter Tuning
- `RLlib`_: Scalable Reinforcement Learning
- `Serve`_: Scalable and Programmable Serving

Or more about `Ray Core`_ and its key abstractions:

- `Tasks`_: Stateless functions executed in the cluster.
- `Actors`_: Stateful worker processes created in the cluster.
- `Objects`_: Immutable values accessible across the cluster.

Ray runs on any machine, cluster, cloud provider, and Kubernetes, and features a growing
`ecosystem of community integrations`_.

Install Ray with: ``pip install ray``. For nightly wheels, see the
`Installation page <https://docs.ray.io/en/latest/installation.html>`__.

.. _`Serve`: https://docs.ray.io/en/latest/serve/index.html
.. _`Datasets`: https://docs.ray.io/en/latest/data/dataset.html
.. _`Workflow`: https://docs.ray.io/en/latest/workflows/concepts.html
.. _`Train`: https://docs.ray.io/en/latest/train/train.html
.. _`Tune`: https://docs.ray.io/en/latest/tune/index.html
.. _`RLlib`: https://docs.ray.io/en/latest/rllib/index.html
.. _`ecosystem of community integrations`: https://docs.ray.io/en/latest/ray-overview/ray-libraries.html


Why Ray?
--------

Today's ML workloads are increasingly compute-intensive. As convenient as they are, single-node development environments such as your laptop cannot scale to meet these demands.

Ray is a unified way to scale Python and AI applications from a laptop to a cluster.

With Ray, you can seamlessly scale the same code from a laptop to a cluster. Ray is designed to be general-purpose, meaning that it can performantly run any kind of workload. If your application is written in Python, you can scale it with Ray, no other infrastructure required.

More Information
----------------

- `Documentation`_
- `Ray Architecture whitepaper`_
- `Ray AIR Technical whitepaper`_
- `Exoshuffle: large-scale data shuffle in Ray`_
- `Ownership: a distributed futures system for fine-grained tasks`_
- `RLlib paper`_
- `Tune paper`_

*Older documents:*

- `Ray paper`_
- `Ray HotOS paper`_

.. _`Ray AIR`: https://docs.ray.io/en/latest/ray-air/getting-started.html
.. _`Ray Core`: https://docs.ray.io/en/latest/ray-core/walkthrough.html
.. _`Tasks`: https://docs.ray.io/en/latest/ray-core/tasks.html
.. _`Actors`: https://docs.ray.io/en/latest/ray-core/actors.html
.. _`Objects`: https://docs.ray.io/en/latest/ray-core/objects.html
.. _`Documentation`: http://docs.ray.io/en/latest/index.html
.. _`Ray Architecture whitepaper`: https://docs.google.com/document/d/1lAy0Owi-vPz2jEqBSaHNQcy2IBSDEHyXNOQZlGuj93c/preview
.. _`Ray AIR Technical whitepaper`: https://docs.google.com/document/d/1bYL-638GN6EeJ45dPuLiPImA8msojEDDKiBx3YzB4_s/preview
.. _`Exoshuffle: large-scale data shuffle in Ray`: https://arxiv.org/abs/2203.05072
.. _`Ownership: a distributed futures system for fine-grained tasks`: https://www.usenix.org/system/files/nsdi21-wang.pdf
.. _`Ray paper`: https://arxiv.org/abs/1712.05889
.. _`Ray HotOS paper`: https://arxiv.org/abs/1703.03924
.. _`RLlib paper`: https://arxiv.org/abs/1712.09381
.. _`Tune paper`: https://arxiv.org/abs/1807.05118

Getting Involved
----------------

.. list-table::
   :widths: 25 50 25 25
   :header-rows: 1

   * - Platform
     - Purpose
     - Estimated Response Time
     - Support Level
   * - `Discourse Forum`_
     - For discussions about development and questions about usage.
     - < 1 day
     - Community
   * - `GitHub Issues`_
     - For reporting bugs and filing feature requests.
     - < 2 days
     - Ray OSS Team
   * - `Slack`_
     - For collaborating with other Ray users.
     - < 2 days
     - Community
   * - `StackOverflow`_
     - For asking questions about how to use Ray.
     - 3-5 days
     - Community
   * - `Meetup Group`_
     - For learning about Ray projects and best practices.
     - Monthly
     - Ray DevRel
   * - `Twitter`_
     - For staying up-to-date on new features.
     - Daily
     - Ray DevRel

.. _`Discourse Forum`: https://discuss.ray.io/
.. _`GitHub Issues`: https://github.com/ray-project/ray/issues
.. _`StackOverflow`: https://stackoverflow.com/questions/tagged/ray
.. _`Meetup Group`: https://www.meetup.com/Bay-Area-Ray-Meetup/
.. _`Twitter`: https://twitter.com/raydistributed
.. _`Slack`: https://forms.gle/9TSdDYUgxYs8SA9e8