Commit graph

14122 commits

Author SHA1 Message Date
Clarence Ng
d22f62640c Merge branch 'master' into oomreleaset
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
2022-09-01 13:04:59 -07:00
Simon Mo
2b732dd1be
[CI] Skip windows://python/ray/serve:test_air_integrations_gpu (#28243)
No GPU on Windows.

Signed-off-by: simon-mo <simon.mo@hey.com>
2022-09-01 12:08:04 -07:00
Ricky Xu
5e0cf74377
remove env (#28218)
Try not to set special flags for nightly test.

Signed-off-by: rickyyx <rickyx@anyscale.com>
2022-09-01 11:58:13 -07:00
Clarence Ng
2ad32a9ef9 format
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
2022-09-01 11:16:14 -07:00
zcin
4c970cc882
[serve] Visualize Deployment Graph with Gradio (#27897) 2022-09-01 10:46:15 -07:00
Antoni Baum
48898aa03d
[AIR][CI] Speed up HF CI by ~20% (#28208)
Speeds up HuggingFaceTrainer/Predictor tests in CI by around ~20% by switching to a different GPT model. This is the same model Hugging Face team uses for their own CI.

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2022-09-01 18:18:10 +01:00
Clarence Ng
27683e1901 Merge branch 'master' into oomreleaset
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
2022-09-01 09:59:01 -07:00
Clarence Ng
3cff48e213 test
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
2022-09-01 09:58:02 -07:00
Clarence Ng
f40a168a6a tests
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
2022-08-31 22:49:01 -07:00
Clarence Ng
73a37e8f9a release test
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
2022-08-31 22:42:34 -07:00
clarng
ac6d63e397
on (#28014)
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
2022-08-31 22:30:41 -07:00
Philipp Moritz
1bba65705a
[doc] Convert custom datetime column when reading a CSV file (#27854)
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
2022-08-31 21:25:28 -07:00
Yi Cheng
d0b879cdb1
[workflow] Change name in step to task_id (#28151)
We've deprecated the name options and use task_id. This is the cleanup to fix everything left.
2022-08-31 20:27:32 -07:00
shrekris-anyscale
f747415d80
[Serve] [Doc] Restore documentation about host and port in Serve config (#28219) 2022-08-31 20:27:00 -07:00
Clarence Ng
670c7da148 oom release test
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
2022-08-31 19:05:42 -07:00
Alan Guo
91cacd6214
Don't unfold first node in dashboard unless there is only one node in the cluster (#28108)
fixes #28107

Also moves the Host / Cmd Line column to be the first column so nodes and workers can be more easily distinguished.
2022-08-31 19:05:24 -07:00
Stephanie Wang
213e24cafd
[tests] Remove unnecessary sleep time from pipelined ingest tests #28182
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
2022-08-31 17:43:58 -07:00
Justin Yu
5cec2492bb
Fix tune resources example code (#28210)
The tune resources user guide contained broken code snippets. This PR fixes those, adds some extra clarifying comments, and improves the code style for readability.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
2022-08-31 14:48:41 -07:00
Ricky Xu
ed2929185c
[Core][State Observability] Wait for all nodes in release test (#28190)
Release tests are failing in buildkite run - however succeeds reliably in manual retry.
Suspected it's because not all nodes available when running with large number of actors.
2022-08-31 13:52:19 -07:00
clarng
65fdd720f9
[core] memory monitor observability improvements: add metrics and log message (#27716)
Add more observability and record events when the raylet kills a task or actor due to memory usage going above threshold.
2022-08-31 13:50:40 -07:00
Artur Niederfahrenhorst
f420407b0d
[ML] Pin Pydantic <= 1.9.2 (#28205)
CI is red because of a dependency issue around dataclass_transform .

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-08-31 13:35:18 -07:00
xwjiang2010
958c22a0b0
[tune] Update GPU warning message in tune. (#28167)
Mention scaling config / with resources

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-08-31 12:29:09 -07:00
Alex Wu
dc08ce55ee
Add autoscaler code owners (#28213)
We already had these on docs, a bit of an oversight not adding this to the autoscaler itself too.

Signed-off-by: Alex Wu <alex@anyscale.io>

Signed-off-by: Alex Wu <alex@anyscale.io>
2022-08-31 12:02:09 -07:00
Jiajun Yao
5e2437923d
[Core] Remove unused args for default_worker.py (#28177)
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-30 21:43:02 -07:00
Yi Cheng
4bff702e7b
[deflakey] Deflakey gcs_heartbeat_manager_test (#28142)
The heartbeat check is every seconds, so it could happen < 1s, which means it could happen very soon. This PR decrease the check period.
2022-08-30 15:26:57 -07:00
Peyton Murray
ffe12a5f10
[Tune] Add rich output for ray tune progress updates in notebooks (#26263)
These changes are part of a series intended to improve integration with notebooks. This PR modifies the tune progress status shown to the user if tuning is run from a notebook.

Previously, part of the trial progress was reported in an HTML table before; now, all progress is displayed in an organized HTML template.

Signed-off-by: pdmurray <peynmurray@gmail.com>
2022-08-30 15:09:40 -07:00
Balaji Veeramani
dad98dcabd
[AIR] Add TorchCheckpoint.from_state_dict (#27970)
PyTorch recommends saving state dictionaries instead of modules, but we don't support any way to do this.

Signed-off-by: Balaji Veeramani balaji@anyscale.com
2022-08-30 13:05:30 -07:00
Antoni Baum
8a30606308
[AIR][Docs] Improve Hugging Face notebook example (#28121)
Improves the HF notebook by making use of preprocessors and adding a section on tuning. Brings it in line with the Ray Summit 2022 demo.

Signed-off-by: Antoni Baum antoni.baum@protonmail.com
2022-08-30 12:36:41 -07:00
Antoni Baum
d7f712d202
[AIR] Split train dataset in HuggingFaceTrainer (#28170)
https://github.com/ray-project/ray/pull/25428 inadvertently turned off train dataset splitting for the `HuggingFaceTrainer`, which meant it wasn't actually running in a data parallel fashion. This PR fixes that.

Signed-off-by: Antoni Baum antoni.baum@protonmail.com
2022-08-30 12:35:44 -07:00
SangBin Cho
f74f155af4
Revert "Revert "Revert "[serve][xlang]Support deploying Python deploy… (#28153)
this starts breaking Mac java build with new errors; I think it is the same issue as before why we reverted this PR

…ment from Java. …" (#27945)"

This reverts commit af488e1.
2022-08-30 12:00:29 -07:00
Kai Fricke
42dc034503
[ci] Pin moto to >= 4.0.0, adjust API (#28099)
If this passes, it should be preferred over #28098.

Adjust moto setup to use new API.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Sven Mika <svenmika1977@gmail.com>
2022-08-30 11:39:32 -07:00
Antoni Baum
13457dab03
[AIR] Fix HF checkpointing with same-node workers (#28154)
If we schedule multiple workers on the head node with HuggingFaceTrainer, a race condition can occur where they will begin moving the checkpoint files from their respective rank folders to one checkpoint folder, causing an exception. This PR fixes that and adds a test that would fail without this change.

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2022-08-30 11:24:13 -07:00
Alex Wu
e643b75129
[release][ci] Update disk size on release tests (#28156)
The minimum size is 300GB

Signed-off-by: Alex Wu <alex@anyscale.io>

Signed-off-by: Alex Wu <alex@anyscale.io>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-08-30 09:29:11 -07:00
Ian Rodney
adf875b4ce
[Cleanup] Update Put error message (#28050)
We allow tasks to return ObjectRefs. I'm not sure when this support was added, but I think for quite a while.
2022-08-30 08:35:20 -07:00
Jiajun Yao
2c6a960733
Don't include script directory in sys.path if it's started via python -m (#28140)
Redo #28043

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-30 08:33:27 -07:00
Yi Cheng
4d91f516ca
[nightly] Add serve ha chaos test into nightly test. (#27413)
This PR adds a serve ha test. The flow of the tests is:

1. check the kube ray build
2. start ray service
3. warm up the cluster
4. start killing nodes
5. get the stats and make sure it's good
2022-08-29 16:55:36 -07:00
Ian Rodney
8934a8d32b
[Raylet][Cleanup] Remove Extra Indent & Fix Typo (#28073)
* Rename `is_existing` to `is_exiting`
* Redundant `if statement`. This is covered by: 

6bedaa5c87/python/ray/_raylet.pyx (L581)
2022-08-29 15:32:36 -07:00
shrekris-anyscale
a15442a510
[docs] Omit bash prompt (#28028)
Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>

Signed-off-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
2022-08-29 14:10:02 -07:00
Amog Kamsetty
acc4903db1
[AIR/Serve] Auto-enable GPU Prediction (#26549)
Automatically enable GPU prediction for Predictors if num_gpus is set for the PredictorDeployment.

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2022-08-29 13:47:56 -07:00
Chong-Li
88a4114ac9
[Core][Enable gcs scheduler 4/n] Fix PG scheduling by gcs scheduler (#27084)
This the first split PR of #25075, which tried to enable gcs scheduler by default.

This split PR mainly includes:

In GcsPlacementGroupScheduler::CommitBundleResources() and ReturnBundleResources(), we have to trigger pending actors (in gcs) because resources have been updated.

Still in the above two functions, we have to update PG wildcard resources in a special way. A PG's wildcard resources (in a certain node) has to be the sum of all related bundle resources. Even though CommitBundleResources() uses ToNodeBundleResourcesMap() to sum up bundle resources, it does not handle the scenario that a single (or subset) bundle is rescheduled, in which this single bundle's wildcard resources would wrongly override the existing one. (see test_placement_group_reschedule_when_node_dead for such a scenario).

Fix the remaining issues from ([Core][Enable gcs scheduler 3/n] integrate placement group with gcs scheduler #24842 (comment)).
2022-08-29 09:39:24 -07:00
ZhuSenlin
c7a3bcc232
[Core] fix resource leak when cancel actor in phase of creating (#27742)
t is quite easy to cause resource/process leak when cancel an actor which constructor is time-consuming.
2022-08-29 09:38:16 -07:00
Artur Niederfahrenhorst
250a73a756
[RLlib] Fix adding policies to RolloutWorkers with complex and discrete observation spaces. (#28133) 2022-08-29 17:44:48 +02:00
Artur Niederfahrenhorst
51d16b8ff9
[RLlib] Test against failure of nodes, for example for practical use of spot instances. (#26676) 2022-08-29 14:37:56 +02:00
Artur Niederfahrenhorst
2ce80d8163
[RLlib] Rename connector's from/to config methods to better reflect that they include state. (#27806) 2022-08-29 14:37:21 +02:00
Kilian Lieret
328e6ac2f4
Slurm: Set load_env to empty string if not specified (#28132) 2022-08-27 20:00:35 -07:00
Kai Fricke
bbd13ddc33
[air/docs] Add example to fetch results dataframe for trainer/tuner (#28067) 2022-08-27 02:01:57 -07:00
Jiajun Yao
e6b0d5f95d
Revert "Don't include script directory in sys.path if it's started via python -m (#28043)" (#28139)
This reverts commit b41ee37c3a.
2022-08-26 21:37:21 -07:00
Jiajun Yao
c8617b9ebf
[Doc] Revamp ray core design patterns doc [3/n]: ray get in a loop (#28113)
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-26 20:41:04 -07:00
Kilian Lieret
67a7481972
[docs/tune] Fix loguniform range in tune tutorial (#28131) 2022-08-26 17:08:00 -07:00
Amog Kamsetty
00f6273775
[Docs] [Tune] ResultGrid Docs and API reference (#28068)
Improve docstring for ResultGrid and show API reference and docstring in Tune API section.

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-26 16:50:35 -07:00