hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 10:01:43 -05:00

Author	SHA1	Message	Date
Jiajun Yao	5e2437923d	[Core] Remove unused args for default_worker.py (#28177 ) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-30 21:43:02 -07:00
Yi Cheng	4bff702e7b	[deflakey] Deflakey gcs_heartbeat_manager_test (#28142 ) The heartbeat check is every seconds, so it could happen < 1s, which means it could happen very soon. This PR decrease the check period.	2022-08-30 15:26:57 -07:00
Peyton Murray	ffe12a5f10	[Tune] Add rich output for ray tune progress updates in notebooks (#26263 ) These changes are part of a series intended to improve integration with notebooks. This PR modifies the tune progress status shown to the user if tuning is run from a notebook. Previously, part of the trial progress was reported in an HTML table before; now, all progress is displayed in an organized HTML template. Signed-off-by: pdmurray <peynmurray@gmail.com>	2022-08-30 15:09:40 -07:00
Balaji Veeramani	dad98dcabd	[AIR] Add `TorchCheckpoint.from_state_dict` (#27970 ) PyTorch recommends saving state dictionaries instead of modules, but we don't support any way to do this. Signed-off-by: Balaji Veeramani balaji@anyscale.com	2022-08-30 13:05:30 -07:00
Antoni Baum	8a30606308	[AIR][Docs] Improve Hugging Face notebook example (#28121 ) Improves the HF notebook by making use of preprocessors and adding a section on tuning. Brings it in line with the Ray Summit 2022 demo. Signed-off-by: Antoni Baum antoni.baum@protonmail.com	2022-08-30 12:36:41 -07:00
Antoni Baum	d7f712d202	[AIR] Split train dataset in `HuggingFaceTrainer` (#28170 ) https://github.com/ray-project/ray/pull/25428 inadvertently turned off train dataset splitting for the `HuggingFaceTrainer`, which meant it wasn't actually running in a data parallel fashion. This PR fixes that. Signed-off-by: Antoni Baum antoni.baum@protonmail.com	2022-08-30 12:35:44 -07:00
SangBin Cho	f74f155af4	Revert "Revert "Revert "[serve][xlang]Support deploying Python deploy… (#28153 ) this starts breaking Mac java build with new errors; I think it is the same issue as before why we reverted this PR …ment from Java. …" (#27945)" This reverts commit `af488e1`.	2022-08-30 12:00:29 -07:00
Kai Fricke	42dc034503	[ci] Pin moto to >= 4.0.0, adjust API (#28099 ) If this passes, it should be preferred over #28098. Adjust moto setup to use new API. Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com>	2022-08-30 11:39:32 -07:00
Antoni Baum	13457dab03	[AIR] Fix HF checkpointing with same-node workers (#28154 ) If we schedule multiple workers on the head node with HuggingFaceTrainer, a race condition can occur where they will begin moving the checkpoint files from their respective rank folders to one checkpoint folder, causing an exception. This PR fixes that and adds a test that would fail without this change. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>	2022-08-30 11:24:13 -07:00
Alex Wu	e643b75129	[release][ci] Update disk size on release tests (#28156 ) The minimum size is 300GB Signed-off-by: Alex Wu <alex@anyscale.io> Signed-off-by: Alex Wu <alex@anyscale.io> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-08-30 09:29:11 -07:00
Ian Rodney	adf875b4ce	[Cleanup] Update Put error message (#28050 ) We allow tasks to return ObjectRefs. I'm not sure when this support was added, but I think for quite a while.	2022-08-30 08:35:20 -07:00
Jiajun Yao	2c6a960733	Don't include script directory in sys.path if it's started via python -m (#28140 ) Redo #28043 Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-30 08:33:27 -07:00
Yi Cheng	4d91f516ca	[nightly] Add serve ha chaos test into nightly test. (#27413 ) This PR adds a serve ha test. The flow of the tests is: 1. check the kube ray build 2. start ray service 3. warm up the cluster 4. start killing nodes 5. get the stats and make sure it's good	2022-08-29 16:55:36 -07:00
Ian Rodney	8934a8d32b	[Raylet][Cleanup] Remove Extra Indent & Fix Typo (#28073 ) * Rename `is_existing` to `is_exiting` * Redundant `if statement`. This is covered by: `6bedaa5c87/python/ray/_raylet.pyx (L581)`	2022-08-29 15:32:36 -07:00
shrekris-anyscale	a15442a510	[docs] Omit bash prompt (#28028 ) Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>	2022-08-29 14:10:02 -07:00
Amog Kamsetty	acc4903db1	[AIR/Serve] Auto-enable GPU Prediction (#26549 ) Automatically enable GPU prediction for Predictors if num_gpus is set for the PredictorDeployment. Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2022-08-29 13:47:56 -07:00
Chong-Li	88a4114ac9	[Core][Enable gcs scheduler 4/n] Fix PG scheduling by gcs scheduler (#27084 ) This the first split PR of #25075, which tried to enable gcs scheduler by default. This split PR mainly includes: In GcsPlacementGroupScheduler::CommitBundleResources() and ReturnBundleResources(), we have to trigger pending actors (in gcs) because resources have been updated. Still in the above two functions, we have to update PG wildcard resources in a special way. A PG's wildcard resources (in a certain node) has to be the sum of all related bundle resources. Even though CommitBundleResources() uses ToNodeBundleResourcesMap() to sum up bundle resources, it does not handle the scenario that a single (or subset) bundle is rescheduled, in which this single bundle's wildcard resources would wrongly override the existing one. (see test_placement_group_reschedule_when_node_dead for such a scenario). Fix the remaining issues from ([Core][Enable gcs scheduler 3/n] integrate placement group with gcs scheduler #24842 (comment)).	2022-08-29 09:39:24 -07:00
ZhuSenlin	c7a3bcc232	[Core] fix resource leak when cancel actor in phase of creating (#27742 ) t is quite easy to cause resource/process leak when cancel an actor which constructor is time-consuming.	2022-08-29 09:38:16 -07:00
Artur Niederfahrenhorst	250a73a756	[RLlib] Fix adding policies to RolloutWorkers with complex and discrete observation spaces. (#28133 )	2022-08-29 17:44:48 +02:00
Artur Niederfahrenhorst	51d16b8ff9	[RLlib] Test against failure of nodes, for example for practical use of spot instances. (#26676 )	2022-08-29 14:37:56 +02:00
Artur Niederfahrenhorst	2ce80d8163	[RLlib] Rename connector's from/to config methods to better reflect that they include state. (#27806 )	2022-08-29 14:37:21 +02:00
Kilian Lieret	328e6ac2f4	Slurm: Set load_env to empty string if not specified (#28132 )	2022-08-27 20:00:35 -07:00
Kai Fricke	bbd13ddc33	[air/docs] Add example to fetch results dataframe for trainer/tuner (#28067 )	2022-08-27 02:01:57 -07:00
Jiajun Yao	e6b0d5f95d	Revert "Don't include script directory in sys.path if it's started via python -m (#28043 )" (#28139 ) This reverts commit `b41ee37c3a`.	2022-08-26 21:37:21 -07:00
Jiajun Yao	c8617b9ebf	[Doc] Revamp ray core design patterns doc [3/n]: ray get in a loop (#28113 ) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-26 20:41:04 -07:00
Kilian Lieret	67a7481972	[docs/tune] Fix loguniform range in tune tutorial (#28131 )	2022-08-26 17:08:00 -07:00
Amog Kamsetty	00f6273775	[Docs] [Tune] `ResultGrid` Docs and API reference (#28068 ) Improve docstring for ResultGrid and show API reference and docstring in Tune API section. Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-08-26 16:50:35 -07:00
Kai Fricke	f59dcdc049	[tune] Re-enable progress metric detection (#28130 ) The API cleanup in #27060 introduced a regression when merging latest master - changes from #26967 were effectively disabled, retaining cluttered output in rllib with verbose=2. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-26 16:44:20 -07:00
Guilherme	6cf363af0d	Updates limit-tasks example (#26644 ) The example fails because it can assign an invalid value to the num_returns parameters in ray.wait function	2022-08-26 15:19:58 -07:00
Kai Fricke	d0678b80ed	[rfc] [air/tune/train] Improve trial/training failure error printing (#27946 ) When training fails, the console output is currently cluttered with tracebacks which are hard to digest. This problem is exacerbated when running multiple trials in a tuning run. The main problems here are: 1. Tracebacks are printed multiple times: In the remote worker and on the driver 2. Tracebacks include many internal wrappers The proposed solution for 1 is to only print tracebacks once (on the driver) or never (if configured). The proposed solution for 2 is to shorten the tracebacks to include mostly user-provided code. ### Deduplicating traceback printing The solution here is to use `logger.error` instead of `logger.exception` in the `function_trainable.py` to avoid printing a traceback in the trainable. Additionally, we introduce an environment variable `TUNE_PRINT_ALL_TRIAL_ERRORS` which defaults to 1. If set to 0, trial errors will not be printed at all in the console (only the error.txt files will exist). To be discussed: We could also default this to 0, but I think the expectation is to see at least some failure output in the console logs per default. ### Removing internal wrappers from tracebacks The solution here is to introcude a magic local variable `_ray_start_tb`. In two places, we use this magic local variable to reduce the stacktrace. A utility `shorten_tb` looks for the last occurence of `_ray_start_tb` in the stacktrace and starts the traceback from there. This takes only linear time. If the magic variable is not present, the full traceback is returned - this means that if the error does not come up in user code, the full traceback is returned, giving visibility in possible internal bugs. Additionally there is an env variable `RAY_AIR_FULL_TRACEBACKS` which disables traceback shortening. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-26 15:02:38 -07:00
Antoni Baum	ea483ecf7a	[AIR][Docs] Clarify how LGBM/XGB trainers work (#28122 )	2022-08-26 14:51:22 -07:00
Kai Fricke	3b3aa80ba3	[tune/ci] Fix link to SigOpt experiment API (#28127 ) Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-26 14:10:53 -07:00
Jiajun Yao	b41ee37c3a	Don't include script directory in sys.path if it's started via python -m (#28043 ) According to https://peps.python.org/pep-0338/ > The -m switch provides a benefit here, as it inserts the current directory into sys.path, instead of the directory contain the main module. We should follow this and don't add the driver script directory to worker's sys.path. I couldn't find a way to detect that the driver is run via `python -m` but instead we don't add the script directory to worker's sys.path if it doesn't exist in driver's sys.path.	2022-08-26 13:27:08 -07:00
Dmitri Gekhtman	ce99cf1b71	[Docs][Kubernetes] Fix link, add a bit of content (#28017 ) Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com> Fixes the "legacy operator" link to point to master, rather than the 2.0.0 branch. The migration README exists in master but not in the 2.0.0 branch. Adds a sentence explaining that the Ray container has to go first in the container list. Adds a sentence to config guide mention min/max replicas and linking to autoscaling. Documents a bug related to GPU auto-detection in KubeRay 0.3.0.	2022-08-26 12:02:18 -07:00
Akash Patel	96d579a4fe	Add support for Python 3.10 (#21221 ) Signed-off-by: acxz <17132214+acxz@users.noreply.github.com>	2022-08-26 11:01:12 -07:00
Amin Allahyar	455fa664e5	Minor update on the key concept explanation (#28032 )	2022-08-26 10:57:58 -07:00
Jiajie Li	6c69ee9a97	Add actor_id in RayActorError (#27802 ) For people who want to have better control over the node failures, and handle the error such as RayActorError by themselves. I think it's necessary to make things like actor_id as an attributed of the error. Signed-off-by: Jiajie Li <ljjsalt@gmail.com>	2022-08-26 10:46:08 -07:00
Dmitri Gekhtman	e98fdef93e	Move cloudwatch. (#28041 ) Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com> For a more balanced table of contents, makes CloudWatch instructions a subsection of AWS instructions.	2022-08-26 08:55:38 -07:00
Jiajun Yao	5139a5c722	Fix broken gym library link (#28111 ) gymlibrary.ml becomes gymlibrary.dev Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-25 19:52:43 -07:00
Guyang Song	cf2cb66d29	[runtime env][java] Support runtime env config in Java (#28083 ) Support job level and task/actor level runtime env config eg. `setupTimeoutSeconds` and `eagerInstall`.	2022-08-26 08:37:39 +08:00
Max Pumperla	50cb51387e	fixes #25860 (#28097 )	2022-08-25 10:45:35 -07:00
Kai Fricke	be7ba70be3	Revert "update grpc to 1.48.0 (#23246 )" (#28101 ) This reverts commit `8f9b4cf69b`. This broke windows test "test_queue": https://buildkite.com/ray-project/ray-builders-branch/builds/9604#0182c78d-254b-4877-a658-1b25cafcad04	2022-08-25 09:37:10 -07:00
Kai Fricke	cf94a31e7a	[CI] Pin moto to < 4.0.0. (#28098 )	2022-08-25 07:55:25 -07:00
Kai Fricke	e0725d1f1d	[docs/ci] Fix (some) broken linkchecks (#28087 ) Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-25 04:41:35 -07:00
Max Pumperla	ec3c7f855e	[docs] add algolia crawler verification (#28094 )	2022-08-25 01:36:26 -07:00
Tao Wang	b6fe6156f5	[C++ worker]Support ActorHandle type return value (#28077 ) Before we support `ActorHandle` type as parameter, this PR adds support for `ActorHandle` type as return type.	2022-08-25 10:05:05 +08:00
Ricky Xu	7e560ad92c	[Core][State Observability] Release test app configs to bypass default limit (#27969 ) This is needed since we are stress-testing the State APIs in release test, and we will need to have a larger max limit than the system default max limit, otherwise, the APIs would return error.	2022-08-24 18:41:54 -07:00
Ian Rodney	8d04afd72b	[Java] Update GSON package (#28072 ) Fixes CVE: https://nvd.nist.gov/vuln/detail/CVE-2022-25647	2022-08-24 13:45:29 -07:00
Steven Morad	ad2bf69548	[AIR; RLlib] Log histograms in wandb. (#28081 )	2022-08-24 08:21:14 -07:00
Artur Niederfahrenhorst	56e7800e0b	[RLlib] Tolerate nan metrics in LearnerInfoBuilder. (#27981 )	2022-08-23 10:07:32 -07:00

1 2 3 4 5 ...

14099 commits