hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	d0678b80ed	[rfc] [air/tune/train] Improve trial/training failure error printing (#27946 ) When training fails, the console output is currently cluttered with tracebacks which are hard to digest. This problem is exacerbated when running multiple trials in a tuning run. The main problems here are: 1. Tracebacks are printed multiple times: In the remote worker and on the driver 2. Tracebacks include many internal wrappers The proposed solution for 1 is to only print tracebacks once (on the driver) or never (if configured). The proposed solution for 2 is to shorten the tracebacks to include mostly user-provided code. ### Deduplicating traceback printing The solution here is to use `logger.error` instead of `logger.exception` in the `function_trainable.py` to avoid printing a traceback in the trainable. Additionally, we introduce an environment variable `TUNE_PRINT_ALL_TRIAL_ERRORS` which defaults to 1. If set to 0, trial errors will not be printed at all in the console (only the error.txt files will exist). To be discussed: We could also default this to 0, but I think the expectation is to see at least some failure output in the console logs per default. ### Removing internal wrappers from tracebacks The solution here is to introcude a magic local variable `_ray_start_tb`. In two places, we use this magic local variable to reduce the stacktrace. A utility `shorten_tb` looks for the last occurence of `_ray_start_tb` in the stacktrace and starts the traceback from there. This takes only linear time. If the magic variable is not present, the full traceback is returned - this means that if the error does not come up in user code, the full traceback is returned, giving visibility in possible internal bugs. Additionally there is an env variable `RAY_AIR_FULL_TRACEBACKS` which disables traceback shortening. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-26 15:02:38 -07:00
Antoni Baum	ea483ecf7a	[AIR][Docs] Clarify how LGBM/XGB trainers work (#28122 )	2022-08-26 14:51:22 -07:00
Kai Fricke	3b3aa80ba3	[tune/ci] Fix link to SigOpt experiment API (#28127 ) Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-26 14:10:53 -07:00
Jiajun Yao	b41ee37c3a	Don't include script directory in sys.path if it's started via python -m (#28043 ) According to https://peps.python.org/pep-0338/ > The -m switch provides a benefit here, as it inserts the current directory into sys.path, instead of the directory contain the main module. We should follow this and don't add the driver script directory to worker's sys.path. I couldn't find a way to detect that the driver is run via `python -m` but instead we don't add the script directory to worker's sys.path if it doesn't exist in driver's sys.path.	2022-08-26 13:27:08 -07:00
Dmitri Gekhtman	ce99cf1b71	[Docs][Kubernetes] Fix link, add a bit of content (#28017 ) Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com> Fixes the "legacy operator" link to point to master, rather than the 2.0.0 branch. The migration README exists in master but not in the 2.0.0 branch. Adds a sentence explaining that the Ray container has to go first in the container list. Adds a sentence to config guide mention min/max replicas and linking to autoscaling. Documents a bug related to GPU auto-detection in KubeRay 0.3.0.	2022-08-26 12:02:18 -07:00
Akash Patel	96d579a4fe	Add support for Python 3.10 (#21221 ) Signed-off-by: acxz <17132214+acxz@users.noreply.github.com>	2022-08-26 11:01:12 -07:00
Amin Allahyar	455fa664e5	Minor update on the key concept explanation (#28032 )	2022-08-26 10:57:58 -07:00
Jiajie Li	6c69ee9a97	Add actor_id in RayActorError (#27802 ) For people who want to have better control over the node failures, and handle the error such as RayActorError by themselves. I think it's necessary to make things like actor_id as an attributed of the error. Signed-off-by: Jiajie Li <ljjsalt@gmail.com>	2022-08-26 10:46:08 -07:00
Dmitri Gekhtman	e98fdef93e	Move cloudwatch. (#28041 ) Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com> For a more balanced table of contents, makes CloudWatch instructions a subsection of AWS instructions.	2022-08-26 08:55:38 -07:00
Jiajun Yao	5139a5c722	Fix broken gym library link (#28111 ) gymlibrary.ml becomes gymlibrary.dev Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-25 19:52:43 -07:00
Guyang Song	cf2cb66d29	[runtime env][java] Support runtime env config in Java (#28083 ) Support job level and task/actor level runtime env config eg. `setupTimeoutSeconds` and `eagerInstall`.	2022-08-26 08:37:39 +08:00
Max Pumperla	50cb51387e	fixes #25860 (#28097 )	2022-08-25 10:45:35 -07:00
Kai Fricke	be7ba70be3	Revert "update grpc to 1.48.0 (#23246 )" (#28101 ) This reverts commit `8f9b4cf69b`. This broke windows test "test_queue": https://buildkite.com/ray-project/ray-builders-branch/builds/9604#0182c78d-254b-4877-a658-1b25cafcad04	2022-08-25 09:37:10 -07:00
Kai Fricke	cf94a31e7a	[CI] Pin moto to < 4.0.0. (#28098 )	2022-08-25 07:55:25 -07:00
Kai Fricke	e0725d1f1d	[docs/ci] Fix (some) broken linkchecks (#28087 ) Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-25 04:41:35 -07:00
Max Pumperla	ec3c7f855e	[docs] add algolia crawler verification (#28094 )	2022-08-25 01:36:26 -07:00
Tao Wang	b6fe6156f5	[C++ worker]Support ActorHandle type return value (#28077 ) Before we support `ActorHandle` type as parameter, this PR adds support for `ActorHandle` type as return type.	2022-08-25 10:05:05 +08:00
Ricky Xu	7e560ad92c	[Core][State Observability] Release test app configs to bypass default limit (#27969 ) This is needed since we are stress-testing the State APIs in release test, and we will need to have a larger max limit than the system default max limit, otherwise, the APIs would return error.	2022-08-24 18:41:54 -07:00
Ian Rodney	8d04afd72b	[Java] Update GSON package (#28072 ) Fixes CVE: https://nvd.nist.gov/vuln/detail/CVE-2022-25647	2022-08-24 13:45:29 -07:00
Steven Morad	ad2bf69548	[AIR; RLlib] Log histograms in wandb. (#28081 )	2022-08-24 08:21:14 -07:00
Artur Niederfahrenhorst	56e7800e0b	[RLlib] Tolerate nan metrics in LearnerInfoBuilder. (#27981 )	2022-08-23 10:07:32 -07:00
Cade Daniel	5fb36d4a7d	Small fixes to job submission cluster docs (#28056 ) I walked through the new job submission cluster docs and sanded down a few rough edges. Signed-off-by: Cade Daniel <cade@anyscale.com>	2022-08-23 09:41:45 -07:00
Cheng Su	debe0cc91f	[Datasets] Re-enable Parquet sampling and add progress bar (#28021 )	2022-08-22 16:59:26 -07:00
Eric Liang	ad40e19ca0	[docs] Add the AIR technical whitepaper to our docs (#28053 )	2022-08-22 16:41:51 -07:00
Akash Patel	8f9b4cf69b	update grpc to 1.48.0 (#23246 ) Updating grpc to 1.48.0 1.47.0 added support for mac m1	2022-08-22 14:53:26 -07:00
Artur Niederfahrenhorst	7ddd14b5db	[RLlib] Fix PPOTorchPolicy producing float metrics when not using critic. (#27980 )	2022-08-22 09:41:36 -07:00
Dmitri Gekhtman	227aef381a	Update Kuberay version in CI. (#27967 ) Updates KubeRay version used in CI to v0.3.0-rc.2 (which we expect to be identical to the final v0.3.0). Also removes a couple of old files. Will open a corresponding cherry pick in the Ray 2.0.0 branch. The key thing to verify is that the CI autoscaling test passes here and in the PR and in the PR against the 2.0.0 branch.	2022-08-20 14:50:52 -07:00
Amr Farid	11c9b1779d	expose imagePullSecret to values.yaml (#27537 )	2022-08-20 06:53:55 -07:00
shrekris-anyscale	ded324d6a4	[Docs] Remove topbar overlap on left table of contents (#28031 )	2022-08-20 02:02:16 -07:00
Alex Wu	f886d9737c	[autoscaler][observability] Provide more detailed events when autoscaler fails to launch a node. (#27891 ) This PR makes the autoscaler event system for node launches more detailed. In particular, it does 4 related things: Less verbose logging for node provider exceptions (printed to logs only, not driver) Don't print to driver "adding 1 node(s) of type ..." when nodes don't launch (still print it if the node launch is successful). Print to driver "Failed to launch ..." Don't log a full exception to the driver. The full driver event looks like this ``` Failed to launch 1 node(s) of type quota. (InsufficientInstanceCapacity): We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c. ``` Co-authored-by: Alex <alex@anyscale.com>	2022-08-19 16:27:02 -07:00
Jun Gong	62b91cbec0	[docs][rllib] Documentation for connectors. (#27528 ) Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-08-19 14:35:07 -07:00
Jun Gong	ec38b96eba	[RLlib] quick fix for learning rate schedule for APPO algorithm. (#28013 )	2022-08-19 14:34:34 -07:00
Chen Shen	da79015be3	[2.0] update 2.0.0 benchmarks #27810 update 2.0.0 benchmarks	2022-08-19 10:34:33 -07:00
Richard Liaw	71efee04f6	[air, clusters/docs] add images to air docs and reformat clusters panels (#28011 )	2022-08-19 08:52:47 -07:00
xiaofeng	af488e1cc2	Revert "Revert "[serve][xlang]Support deploying Python deployment from Java. …" (#27945 )	2022-08-18 17:57:37 -07:00
Cade Daniel	a6b7189ab3	Fixing formatting around TODO that found its way into compiled docs. (#28001 ) Signed-off-by: Cade Daniel <cade@anyscale.com> Fixing formatting around TODO that found its way into compiled docs.	2022-08-18 17:46:39 -07:00
shrekris-anyscale	4395f8792f	[Serve] [Docs] Fix link in Serve Config Files documentation (#27993 )	2022-08-18 14:50:23 -07:00
SangBin Cho	9950e9c1f4	[Doc] CLI Reference Documentation Revamp (#27862 ) Take out the CLI reference from the core API subsection. It follows the same CLI reference pattern as other library (e.g., Serve has Serve CLI under Serve API section).	2022-08-18 14:29:31 -07:00
Dmitri Gekhtman	c2ead88aca	[kuberay][docs] Experimental features (#27898 )	2022-08-18 11:37:06 -07:00
Dmitri Gekhtman	98c90b8488	[clusters][docs] Provide urls to content, fix typos (#27936 )	2022-08-18 11:33:04 -07:00
Dmitri Gekhtman	6cf263838f	[docs][touch-up] Add ephemeral storage to Ray-on-K8s example. (#27916 )	2022-08-18 11:29:55 -07:00
Sihan Wang	112f104fb6	[Serve][Doc] Fix user guide tables (#27991 )	2022-08-18 10:55:31 -07:00
Eric Liang	47f3d83379	[docs] Minor AIR figure updates (#27965 )	2022-08-18 10:30:24 -07:00
Jian Xiao	440ae620eb	Cap the number of stats kept in StatsActor and purge in FIFO order if the limit exceeded (#27964 ) There is a risk of using too much of memory in StatsActor, because its lifetime is the same as cluster lifetime. This puts a cap on how many stats to keep, and purge the stats in FIFO order if this cap is exceeded.	2022-08-18 10:25:31 -07:00
Cheng Su	24aeea8332	[Datasets] Add Cheng as code owner of data (#27912 ) Checked with team in Slack channel, did not see objection to add me as code owner. Signed-off-by: Cheng Su <scnju13@gmail.com>	2022-08-18 10:01:21 -07:00
Cheng Su	45e5e8c6ea	[Datasets] Customized serializer for Arrow JSON ParseOptions in read_json (#27911 ) This PR is to add customized serializer of Arrow JSON ParseOptions for read_json. We found user wanted to read JSON file with ParseOptions, but it's currently not working due to pickle issue (detail of post). So here we add a customized serializer for ParseOptions as a workaround for now, similar to #25821. Signed-off-by: Cheng Su <scnju13@gmail.com>	2022-08-18 10:00:56 -07:00
Simon Mo	6659971f95	[Serve][Java] Add Serve to Jar Building Process (#27976 ) So that they are available to be to be downloaded and installed on nightly	2022-08-17 23:06:14 -05:00
Jiajun Yao	7d981d6ced	Mark dataset_shuffle_push_based_random_shuffle_100tb as stable (#27963 ) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-17 15:05:15 -07:00
Jiajun Yao	0a3a5e68a4	Revamp ray core design patterns doc [2/n]: too fine grained tasks (#27919 ) Move the code to doc_code Fix the code example to make batching faster than serial run. Related issue number #27048 Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-17 13:52:50 -07:00
Charles Sun	edde905741	[RLlib] Add Decision Transformer (DT) (#27890 )	2022-08-17 13:49:13 -07:00

1 2 3 4 5 ...

14070 commits