Commit graph

7433 commits

Author SHA1 Message Date
Qing Wang
07e619f404
Remove unsed script. (#14462) 2021-03-04 11:24:00 +08:00
Ian Rodney
759892740a
[Autoscaler] chown Ray_bootstrap Files in DockerCommandRunner (#14380) 2021-03-03 19:13:20 -08:00
Antoine Galataud
460c2757a3
Allow assigning weight to var with close name (#14109)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-03-03 19:11:34 -08:00
Eric Liang
99a63b3dd1
Remove old scheduler and friends (#14184) 2021-03-03 18:29:15 -08:00
Dmitri Gekhtman
3f6c23e3cc
[doc][autoscaler][minor] Fix quickstart guide: ray.init(address='auto') (#14459) 2021-03-03 17:58:52 -08:00
Richard Liaw
dba533dd84
Disable more torch (#14480)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-03-03 15:46:32 -08:00
tchordia
e40dc3a3e9
[serve] Better validation for arguments to client.start() (#14327) 2021-03-03 14:33:36 -08:00
Richard Liaw
60a8b67488
Disable mnist tests (#14474) 2021-03-03 13:25:01 -08:00
Hao Zhang
4135b0eb4a
[Collective] Supporting multistream, stream pool, and CUDA events. (#14127)
Co-authored-by: fustinose <fustinosej@gmail.com>
2021-03-03 09:53:45 -08:00
ZhuSenlin
dcff25aed6
remove invalid code inside NodeManager::NodeAdded (#14273)
Co-authored-by: senlin.zsl <senlin.zsl@antgroup.com>
2021-03-03 09:20:21 -08:00
SangBin Cho
a04ab9b472
[Core] Fix ray memory bug (#14452)
* ray memory bug

* Fix ray memory issue.

* done.
2021-03-03 09:20:00 -08:00
SangBin Cho
1d2136959f
[Core] Fix port issue (#14435)
* Initial impl.

* Update.

* fixed a bug.

* Fix all the issues.

* Addressed code review.

* Addressed code review.

* Fix a test failure.
2021-03-03 09:16:00 -08:00
Xianyang Liu
fc9182e63c
Fixes autoscaling monitor when environment has set http_proxy or https_proxy (#14351) 2021-03-03 18:22:53 +02:00
Sven Mika
5637d89ecc
[RLlib] Serve + RLlib example script. (#14416) 2021-03-03 14:33:03 +01:00
Sven Mika
7718ec70fb
[RLlib] Remove old SegmentTree from tests dir and unflake respective segment tree test. (#14450) 2021-03-03 14:31:30 +01:00
Kai Yang
d653394c7f
[Java] Some bug fixes about Java UT workflow (#14444) 2021-03-03 19:32:14 +08:00
Kai Yang
c53c909130
[Java] Quit worker process after RunTaskExecutionLoop to avoid orphan Java worker processses (#14442) 2021-03-03 16:47:17 +08:00
Antoni Baum
85a092c3d7
[Tune] Fix HEBO evaluated rewards for max mode & save/restore (#14427)
* Fix HEBO evaluated rewards for max mode

* Lint

* Make sure everything necessary is saved
2021-03-03 09:44:43 +01:00
Richard Liaw
63c2b7356e
Disable windows tests for test_iter and test_reference_counting (#14455)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-03-03 00:39:59 -08:00
fangfengbin
1054613da1
[Core]Fix ray.kill doesn't cancel pending actor bug (#14154)
Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2021-03-03 16:12:32 +08:00
Stephanie Wang
5c6c9d5b91
[core] Spill tasks from waiting queue (#14288)
* Spill back waiting tasks

* test

* test

* todo

* Avoid iterating over args

* update

* lint

* Fix test

* test

* Test force spillback

* Unit test resource scheduler

* test

* travis?

* rename

* debug

* revert flaky test

* lint

* fix test

* fix
2021-03-02 22:30:02 -08:00
Dmitri Gekhtman
1675156a8b
[autoscaler][interface] Use multi node types in defaults.yaml and example-full.yaml (#14239)
* random doc typo

* example-full-multi

* left off max workers

* wip

* address comments, modify defaults, wip

* fix

* wip

* reformat more things

* undo useless diff

* space

* max workers

* space

* copy-paste mishaps

* space

* More copy-paste mishaps

* copy-paste issues, space, max_workers

* head_node_type

* legacy yamls

* line undeleted

* correct-gpu

* Remove redundant GPU example.

* Extraneous comment

* whitespace

* example-java.yaml

* Revert "example-java.yaml"

This reverts commit 1e9c0124b9d97e651aaeeb6ec5bf7a4ef2a2df17.

* tests and other things

* doc

* doc

* revert max worker default

* Kubernetes comment

* wip

* wip

* tweak

* Address comments

* test_resource_demand_scheduler fixes

* Head type min/max workers, aws resources

* fix example_cluster2.yaml

* Fix external node type test (compatibility with legacy-style external node types)

* fix test_autoscaler_aws

* gcp-images

* gcp node type names

* fix gcp defaults

* doc format

* typo

* Skip failed Windows tests

* doc string and comment

* assert

* remove contents of default external head and worker

* legacy external failed validation test

* Readability -- define the minimal external config at the top of the file.

* Remove default worker type min worker

* Remove extraneous global min_workers comment.

* per-node-type docker in aws/example-gpu-docker

* ray.worker.small -> ray.worker.default

* fix-docker

* fix gpu docker again

* undo kubernetes experiment

* fix doc

* remove worker max_worker from kubernetes

* remove max_worker from local worker node type

* fix doc again

* py38

* eric-comment

* fix cluster name

* fix-test-autoscaler

* legacy config logic

* pop resources

* Remove min_workers AFTER merge

* comment, warning message

* warning, comment
2021-03-03 06:16:19 +02:00
Eric Liang
ef873be9e8
Require opt-in to switching plasma to /tmp instead of /dev/shm (#14451) 2021-03-02 16:44:33 -08:00
Richard Liaw
d92c00e233
Pin autogluon.core for builds (#14448) 2021-03-02 15:55:03 -08:00
Kai Fricke
47603045f9
[tune] Move Optuna to ask/tell interface (#14387) 2021-03-02 15:35:11 -08:00
SangBin Cho
bacbdd297b
[Core] Do not unregister workers that own objects by worker capping mechanism. (#14408)
* Almost done.

* Initial implementation done.

* Fix issue.

* Addressed the initial code review.

* improve comments.

* Addressed code review.

* Adding unit tests.

* Complete unit tests.

* Resolve all issues.

* Fix issues.
2021-03-02 12:24:22 -08:00
Edward Oakes
b7516ef667
hide CLI option for redis shard ports (#14434) 2021-03-02 11:06:34 -08:00
Alex Wu
4572c6cf0f
[autoscaler] Fix tag cache bug, don't kill workers on error (#14424) 2021-03-02 11:06:06 -08:00
Yi Cheng
d921dca075
[core] Fixing bug when dispatching tasks to deleted placement group (#14300) 2021-03-02 10:24:53 -08:00
Richard Liaw
aa24f8db9d
[tune] fix tune builds (#14447) 2021-03-02 10:20:20 -08:00
Stephanie Wang
a24ac13671
[core] Randomize actor ID to avoid collisions (#14358)
* Randomize actor ID

* Mix index and current time, add python test

* test

* nanos
2021-03-02 10:00:28 -08:00
Richard Liaw
1e25747818
Fix docker builds by marking resolver (#14445)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-03-02 09:36:40 -08:00
Tao Wang
2de01ee3b1
[GCS]Cherry pick heartbeat function into another thread (#14301) 2021-03-02 17:49:02 +08:00
SangBin Cho
09fd38ede1
[Multi node shuffle] More efficient ray memory --stats-only (#14423)
* Done.

* Fix all the issues.
2021-03-01 23:14:06 -08:00
Dmitri Gekhtman
58c0959ea7
[kubernetes][docs][minor] Move Kubernetes example scripts to docs (#14412) 2021-03-01 20:17:16 -08:00
Amog Kamsetty
ca11b189b8
[Tune] use epoch for ptl checkpoint dir name (#14392)
* use epoch for dir name

* use formatted string
2021-03-01 20:14:35 -08:00
Eric Liang
dbaa28f81e
Add links to new rllib paper (#14432) 2021-03-01 20:11:40 -08:00
SangBin Cho
0ec8efbb47
[Core] Minor fixes (#14411)
* Fix issue.

* Lint.

* Addressed code review.
2021-03-01 18:37:05 -08:00
SangBin Cho
b1e0409447
[Test] Improve scalability envelope (#14406)
* fixed.

* fix.

* Update the result.

* Addressed code review.
2021-03-01 18:36:52 -08:00
Eric Liang
eab53a8808
Update Ray client docs (#14422) 2021-03-01 14:08:34 -08:00
Eric Liang
9db000ff2c
Auto report object store memory usage; remove some deprecated code (#14260) 2021-03-01 13:19:44 -08:00
Edward Oakes
ff00a89927
Enable test_async_goal_manager (#14419) 2021-03-01 14:20:28 -06:00
Barak Michener
2a28585bb3
[ray_client]: Add architecture doc (#14265) 2021-03-01 10:56:11 -08:00
Ian Rodney
9125b6bca3
[Autoscaler][GCP] Use Python3.8 in defaults.yaml (#14417) 2021-03-01 10:50:39 -08:00
Micah Yong
db0c16824c
[Dashboard][CLI] Ray memory parity with dashboard 2 (#13444)
* Minor improvements in Ray Core Walkthrough as seen in https://github.com/ray-project/ray/issues/12472

* Define node_stats() to return NodeStats object from cluster

* Add --group-by and --sort-by capabilities to ray memory script

* Resolve merge conflict

* Add helper functions for group by and sorting type in memory_utils.py

* Reformat

* Format

* Compartmentalize memory script into get_memory_summary and get_store_stats_summary

* Modify unit tests in test_mem_stat

* Lint and format

* Test cases for group_by sort_by

* Lint and format

* Fix actor handle failing test case

* Update test_memstat.py

* Resolve merge conflicts

* Adjust ray memory output based on terminal size

* Formatting and linting

* Use constant for callsite length

* Switch from OS to shutil for querying terminal size (official python support)

* Linting and formatting

* Lint and format

* Resolve lint issue in walkthrough.rst

* Revert to python 3.6

* Delete visitor.py

It was accidentally included in most recent commit

* Delete .eggs

It was accidentally included in most recent commit

* Resolve test_object_spilling.py test case

* Add stats only argument

* revert changes on this file

* Remove package-lock.json

* Add back npm installation

* Sync package-lock.json

* Linting and formatting

* Sync with package-lock

* Sync with package-lock pt 2

* Update documentation in https://docs.ray.io/en/master/memory-management.html

* Add include_memory_info as argument for node_stats

* Switch object ref and call site positions

* Linting and formatting

* Change from MiB to B

* Change from stats-only to store-true

* Add memory test case

* Add memory test case

* Lint and format

* Correct test in memstat

* Change line wrap and stats only to flags

* Clarify --stats-only and --no-format in ray memory

* --stats-only description modified

Co-authored-by: Micah Yong <micahyong@Micahs-MacBook-Pro.local>
2021-03-01 09:27:22 -08:00
Raphael CHEN
343ebf8ea7
[tune] Checkpoint according to nested metric (#14379) 2021-03-01 17:14:39 +01:00
Qing Wang
f7f64e90ed
[Minor] Remove unused field. (#14382)
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2021-03-01 19:35:28 +08:00
dependabot[bot]
cda4ad044a
[tune](deps): Bump mlflow from 1.13.1 to 1.14.0 in /python/requirements (#14396)
Bumps [mlflow](https://github.com/mlflow/mlflow) from 1.13.1 to 1.14.0.
- [Release notes](https://github.com/mlflow/mlflow/releases)
- [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.rst)
- [Commits](https://github.com/mlflow/mlflow/compare/v1.13.1...v1.14.0)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-03-01 12:28:15 +01:00
dependabot[bot]
c925e8d14c
[tune](deps): Bump ax-platform in /python/requirements (#14398)
Bumps [ax-platform](https://github.com/facebook/Ax) from 0.1.19 to 0.1.20.
- [Release notes](https://github.com/facebook/Ax/releases)
- [Changelog](https://github.com/facebook/Ax/blob/master/CHANGELOG.md)
- [Commits](https://github.com/facebook/Ax/compare/0.1.19...v0.1.20)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-03-01 12:27:45 +01:00
Kai Fricke
7f9340bb2f
[tune] Add leading zeros to checkpoint directory (#14152)
* [tune] Add leading zeros to checkpoint directory

* Fix exp analysis tests/support string indices

* Fix tests

* RLLib tests
2021-03-01 12:12:19 +01:00