hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Clarence Ng	d22f62640c	Merge branch 'master' into oomreleaset Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>	2022-09-01 13:04:59 -07:00
Simon Mo	2b732dd1be	[CI] Skip windows://python/ray/serve:test_air_integrations_gpu (#28243 ) No GPU on Windows. Signed-off-by: simon-mo <simon.mo@hey.com>	2022-09-01 12:08:04 -07:00
Ricky Xu	5e0cf74377	remove env (#28218 ) Try not to set special flags for nightly test. Signed-off-by: rickyyx <rickyx@anyscale.com>	2022-09-01 11:58:13 -07:00
Clarence Ng	2ad32a9ef9	format Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>	2022-09-01 11:16:14 -07:00
zcin	4c970cc882	[serve] Visualize Deployment Graph with Gradio (#27897 )	2022-09-01 10:46:15 -07:00
Antoni Baum	48898aa03d	[AIR][CI] Speed up HF CI by ~20% (#28208 ) Speeds up HuggingFaceTrainer/Predictor tests in CI by around ~20% by switching to a different GPT model. This is the same model Hugging Face team uses for their own CI. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>	2022-09-01 18:18:10 +01:00
Clarence Ng	27683e1901	Merge branch 'master' into oomreleaset Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>	2022-09-01 09:59:01 -07:00
Clarence Ng	3cff48e213	test Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>	2022-09-01 09:58:02 -07:00
Clarence Ng	f40a168a6a	tests Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>	2022-08-31 22:49:01 -07:00
Clarence Ng	73a37e8f9a	release test Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>	2022-08-31 22:42:34 -07:00
clarng	ac6d63e397	on (#28014 ) Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>	2022-08-31 22:30:41 -07:00
Philipp Moritz	1bba65705a	[doc] Convert custom datetime column when reading a CSV file (#27854 ) Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>	2022-08-31 21:25:28 -07:00
Yi Cheng	d0b879cdb1	[workflow] Change name in step to task_id (#28151 ) We've deprecated the name options and use task_id. This is the cleanup to fix everything left.	2022-08-31 20:27:32 -07:00
shrekris-anyscale	f747415d80	[Serve] [Doc] Restore documentation about host and port in Serve config (#28219 )	2022-08-31 20:27:00 -07:00
Clarence Ng	670c7da148	oom release test Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>	2022-08-31 19:05:42 -07:00
Alan Guo	91cacd6214	Don't unfold first node in dashboard unless there is only one node in the cluster (#28108 ) fixes #28107 Also moves the Host / Cmd Line column to be the first column so nodes and workers can be more easily distinguished.	2022-08-31 19:05:24 -07:00
Stephanie Wang	213e24cafd	[tests] Remove unnecessary sleep time from pipelined ingest tests #28182 Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>	2022-08-31 17:43:58 -07:00
Justin Yu	5cec2492bb	Fix tune resources example code (#28210 ) The tune resources user guide contained broken code snippets. This PR fixes those, adds some extra clarifying comments, and improves the code style for readability. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>	2022-08-31 14:48:41 -07:00
Ricky Xu	ed2929185c	[Core][State Observability] Wait for all nodes in release test (#28190 ) Release tests are failing in buildkite run - however succeeds reliably in manual retry. Suspected it's because not all nodes available when running with large number of actors.	2022-08-31 13:52:19 -07:00
clarng	65fdd720f9	[core] memory monitor observability improvements: add metrics and log message (#27716 ) Add more observability and record events when the raylet kills a task or actor due to memory usage going above threshold.	2022-08-31 13:50:40 -07:00
Artur Niederfahrenhorst	f420407b0d	[ML] Pin Pydantic <= 1.9.2 (#28205 ) CI is red because of a dependency issue around dataclass_transform . Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-08-31 13:35:18 -07:00
xwjiang2010	958c22a0b0	[tune] Update GPU warning message in tune. (#28167 ) Mention scaling config / with resources Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-08-31 12:29:09 -07:00
Alex Wu	dc08ce55ee	Add autoscaler code owners (#28213 ) We already had these on docs, a bit of an oversight not adding this to the autoscaler itself too. Signed-off-by: Alex Wu <alex@anyscale.io> Signed-off-by: Alex Wu <alex@anyscale.io>	2022-08-31 12:02:09 -07:00
Jiajun Yao	5e2437923d	[Core] Remove unused args for default_worker.py (#28177 ) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-30 21:43:02 -07:00
Yi Cheng	4bff702e7b	[deflakey] Deflakey gcs_heartbeat_manager_test (#28142 ) The heartbeat check is every seconds, so it could happen < 1s, which means it could happen very soon. This PR decrease the check period.	2022-08-30 15:26:57 -07:00
Peyton Murray	ffe12a5f10	[Tune] Add rich output for ray tune progress updates in notebooks (#26263 ) These changes are part of a series intended to improve integration with notebooks. This PR modifies the tune progress status shown to the user if tuning is run from a notebook. Previously, part of the trial progress was reported in an HTML table before; now, all progress is displayed in an organized HTML template. Signed-off-by: pdmurray <peynmurray@gmail.com>	2022-08-30 15:09:40 -07:00
Balaji Veeramani	dad98dcabd	[AIR] Add `TorchCheckpoint.from_state_dict` (#27970 ) PyTorch recommends saving state dictionaries instead of modules, but we don't support any way to do this. Signed-off-by: Balaji Veeramani balaji@anyscale.com	2022-08-30 13:05:30 -07:00
Antoni Baum	8a30606308	[AIR][Docs] Improve Hugging Face notebook example (#28121 ) Improves the HF notebook by making use of preprocessors and adding a section on tuning. Brings it in line with the Ray Summit 2022 demo. Signed-off-by: Antoni Baum antoni.baum@protonmail.com	2022-08-30 12:36:41 -07:00
Antoni Baum	d7f712d202	[AIR] Split train dataset in `HuggingFaceTrainer` (#28170 ) https://github.com/ray-project/ray/pull/25428 inadvertently turned off train dataset splitting for the `HuggingFaceTrainer`, which meant it wasn't actually running in a data parallel fashion. This PR fixes that. Signed-off-by: Antoni Baum antoni.baum@protonmail.com	2022-08-30 12:35:44 -07:00
SangBin Cho	f74f155af4	Revert "Revert "Revert "[serve][xlang]Support deploying Python deploy… (#28153 ) this starts breaking Mac java build with new errors; I think it is the same issue as before why we reverted this PR …ment from Java. …" (#27945)" This reverts commit `af488e1`.	2022-08-30 12:00:29 -07:00
Kai Fricke	42dc034503	[ci] Pin moto to >= 4.0.0, adjust API (#28099 ) If this passes, it should be preferred over #28098. Adjust moto setup to use new API. Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Sven Mika <svenmika1977@gmail.com>	2022-08-30 11:39:32 -07:00
Antoni Baum	13457dab03	[AIR] Fix HF checkpointing with same-node workers (#28154 ) If we schedule multiple workers on the head node with HuggingFaceTrainer, a race condition can occur where they will begin moving the checkpoint files from their respective rank folders to one checkpoint folder, causing an exception. This PR fixes that and adds a test that would fail without this change. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>	2022-08-30 11:24:13 -07:00
Alex Wu	e643b75129	[release][ci] Update disk size on release tests (#28156 ) The minimum size is 300GB Signed-off-by: Alex Wu <alex@anyscale.io> Signed-off-by: Alex Wu <alex@anyscale.io> Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-08-30 09:29:11 -07:00
Ian Rodney	adf875b4ce	[Cleanup] Update Put error message (#28050 ) We allow tasks to return ObjectRefs. I'm not sure when this support was added, but I think for quite a while.	2022-08-30 08:35:20 -07:00
Jiajun Yao	2c6a960733	Don't include script directory in sys.path if it's started via python -m (#28140 ) Redo #28043 Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-30 08:33:27 -07:00
Yi Cheng	4d91f516ca	[nightly] Add serve ha chaos test into nightly test. (#27413 ) This PR adds a serve ha test. The flow of the tests is: 1. check the kube ray build 2. start ray service 3. warm up the cluster 4. start killing nodes 5. get the stats and make sure it's good	2022-08-29 16:55:36 -07:00
Ian Rodney	8934a8d32b	[Raylet][Cleanup] Remove Extra Indent & Fix Typo (#28073 ) * Rename `is_existing` to `is_exiting` * Redundant `if statement`. This is covered by: `6bedaa5c87/python/ray/_raylet.pyx (L581)`	2022-08-29 15:32:36 -07:00
shrekris-anyscale	a15442a510	[docs] Omit bash prompt (#28028 ) Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com> Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>	2022-08-29 14:10:02 -07:00
Amog Kamsetty	acc4903db1	[AIR/Serve] Auto-enable GPU Prediction (#26549 ) Automatically enable GPU prediction for Predictors if num_gpus is set for the PredictorDeployment. Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2022-08-29 13:47:56 -07:00
Chong-Li	88a4114ac9	[Core][Enable gcs scheduler 4/n] Fix PG scheduling by gcs scheduler (#27084 ) This the first split PR of #25075, which tried to enable gcs scheduler by default. This split PR mainly includes: In GcsPlacementGroupScheduler::CommitBundleResources() and ReturnBundleResources(), we have to trigger pending actors (in gcs) because resources have been updated. Still in the above two functions, we have to update PG wildcard resources in a special way. A PG's wildcard resources (in a certain node) has to be the sum of all related bundle resources. Even though CommitBundleResources() uses ToNodeBundleResourcesMap() to sum up bundle resources, it does not handle the scenario that a single (or subset) bundle is rescheduled, in which this single bundle's wildcard resources would wrongly override the existing one. (see test_placement_group_reschedule_when_node_dead for such a scenario). Fix the remaining issues from ([Core][Enable gcs scheduler 3/n] integrate placement group with gcs scheduler #24842 (comment)).	2022-08-29 09:39:24 -07:00
ZhuSenlin	c7a3bcc232	[Core] fix resource leak when cancel actor in phase of creating (#27742 ) t is quite easy to cause resource/process leak when cancel an actor which constructor is time-consuming.	2022-08-29 09:38:16 -07:00
Artur Niederfahrenhorst	250a73a756	[RLlib] Fix adding policies to RolloutWorkers with complex and discrete observation spaces. (#28133 )	2022-08-29 17:44:48 +02:00
Artur Niederfahrenhorst	51d16b8ff9	[RLlib] Test against failure of nodes, for example for practical use of spot instances. (#26676 )	2022-08-29 14:37:56 +02:00
Artur Niederfahrenhorst	2ce80d8163	[RLlib] Rename connector's from/to config methods to better reflect that they include state. (#27806 )	2022-08-29 14:37:21 +02:00
Kilian Lieret	328e6ac2f4	Slurm: Set load_env to empty string if not specified (#28132 )	2022-08-27 20:00:35 -07:00
Kai Fricke	bbd13ddc33	[air/docs] Add example to fetch results dataframe for trainer/tuner (#28067 )	2022-08-27 02:01:57 -07:00
Jiajun Yao	e6b0d5f95d	Revert "Don't include script directory in sys.path if it's started via python -m (#28043 )" (#28139 ) This reverts commit `b41ee37c3a`.	2022-08-26 21:37:21 -07:00
Jiajun Yao	c8617b9ebf	[Doc] Revamp ray core design patterns doc [3/n]: ray get in a loop (#28113 ) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-26 20:41:04 -07:00
Kilian Lieret	67a7481972	[docs/tune] Fix loguniform range in tune tutorial (#28131 )	2022-08-26 17:08:00 -07:00
Amog Kamsetty	00f6273775	[Docs] [Tune] `ResultGrid` Docs and API reference (#28068 ) Improve docstring for ResultGrid and show API reference and docstring in Tune API section. Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-08-26 16:50:35 -07:00

1 2 3 4 5 ...

14122 commits