Commit graph

139 commits

Author SHA1 Message Date
mwtian
a39fd74674
disable //python/ray/tests:test_autoscaler_drain_node_api in HA GCS build (#20296) 2021-11-12 15:47:42 -08:00
xwjiang2010
ce8504b0b2
[CI] Rebalance Tune tests a bit. (#20263) 2021-11-12 15:30:18 +00:00
mwtian
be29fa0302
[CI] make using gcc 9 explicit (#20147) 2021-11-11 16:12:40 -08:00
chenk008
74fa267c72
Enable worker in container CI test (#20174) 2021-11-11 16:11:06 -08:00
mwtian
0330852baf
[Core][Pubsub] Implement Python GCS publisher and subscriber (#20111)
## Why are these changes needed?
This change adds Python publisher and subscriber in `gcs_utils.py`, and GRPC handler on GCS for publishing iva GCS. Error info is migrated to use the GCS-based pubsub, if feature flag `RAY_gcs_grpc_based_pubsub=true`.

Also, add a `--gcs-address` flag to some Python processes. It is not set anywhere yet, but will be set aftering Redis-less bootstrapping work.

Unit tests are added for the Python publisher and subscriber. Migrated error info publishers and subscribers are tested with existing unit tests, e.g. tests calling `ray._private.test_utils.get_error_message()` to ensure error info is published.

GCS based pubsub has gaps in handling deadline, cancelled requests and GCS restarts. So 3 more unit tests are disabled in the `HA GCS` mode. They will be addressed in a separate change.

## Related issue number
2021-11-11 14:59:57 -08:00
xwjiang2010
883fbd003c
[CI; Tune] Split Tune tests and examples (#20210)
* Split Tune tests and examples part 1 into tests and examples separate.

* fix typo.

* fix typo.

* Add docs.
2021-11-11 10:50:51 +01:00
Sven Mika
ebd56b57db
[RLlib; documentation] "RLlib in 60sec" overhaul. (#20215) 2021-11-10 22:20:06 +01:00
matthewdeng
78e9ff7c91
[train][datasets] add example for big data training (#20042)
* [train][datasets] add example for big data training

* add title docstring

* lint and dependencies

* add dask_ml requirement
2021-11-05 09:28:48 -07:00
Sven Mika
50c30f89c6
[Tune; RLlib] Move Tune tests that use RLlib into separate buildkite job. (#20016) 2021-11-04 20:40:57 +01:00
Yi Cheng
65d3054a09
[build] fix the wrong flag for gcs ha test (#20052)
## Why are these changes needed?
It should be `RAY_gcs_grpc_based_pubsub` instead of `Ray_gcs_grpc_based_pubsub`

## Related issue number
2021-11-04 09:59:11 -07:00
Sven Mika
4cb23d1c95
[Tune; Testing] Revert to 3.7 (undone by accident by previous PR); + some minor comment cleanups. (#20031) 2021-11-04 10:58:34 +01:00
mwtian
a26474156d
Use GCC 9 in GPU docker (#20024) 2021-11-03 22:53:17 -07:00
mwtian
f83195a1e1
[Build] Add GCS HA builds (#20008)
## Why are these changes needed?
Add builds for Python tests with GCS pubsub enabled.

## Related issue number
2021-11-03 11:58:16 -07:00
Jiajun Yao
5de4a38948
[CI] Run Java CI on Mac (#19757)
Why are these changes needed?
Enable Java tests on Mac CI to avoid more breakages.

Related issue number
Closes #19700
2021-11-03 23:40:05 +08:00
Avnish Narayan
026bf01071
[RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535)
* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7

* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

* Reformatting

* Fixing tests

* Move atari-py install conditional to req.txt

* migrate to new ale install method

* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7
* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

Move atari-py install conditional to req.txt

migrate to new ale install method

Make parametric_actions_cartpole return float32 actions/obs

Adding type conversions if obs/actions don't match space

Add utils to make elements match gym space dtypes

Co-authored-by: Jun Gong <jungong@anyscale.com>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-03 16:24:00 +01:00
Sven Mika
e6ae08f416
[RLlib] Optionally don't drop last ts in v-trace calculations (APPO and IMPALA). (#19601) 2021-11-03 10:01:34 +01:00
Sven Mika
2d24ef0d32
[RLlib] Add all simple learning tests as framework=tf2. (#19273)
* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and Tune tests have
been moved to python 3.7

* fix tune test_sampler::testSampleBoundsAx

* fix re-install ray for py3.7 tests

Co-authored-by: avnishn <avnishn@uw.edu>
2021-11-02 12:10:17 +01:00
xwjiang2010
c48d86e469
[CI] change git protocol to use https. (#19964) 2021-11-01 19:38:58 -07:00
mwtian
7afdfdc6dd
[CI] narrow down tests that run when files change (#19656) 2021-10-29 16:47:54 -07:00
matthewdeng
bfb0ef1b08
move jsonschema to core dependencies and update default AutoscalerPrometheusMetrics (#19831) 2021-10-28 13:04:22 -07:00
Simon Mo
5e927b01ad
Revert "[CI] Remove config that disables Bazel test result cache" (#19818)
* Revert "[CI] Remove config that disables Bazel test result cache (#18701)"

This reverts commit 098ff36faa.

* Remove all RLlib tests from BUILD that currently fail.

Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-10-28 15:54:53 +02:00
Amog Kamsetty
db863aafc0
Revert "Revert "[Docker] Support multiple CUDA Versions (#19505)" (#19756)" (#19763)
This reverts commit e58fcca404.
2021-10-26 17:32:56 -07:00
Amog Kamsetty
e58fcca404
Revert "[Docker] Support multiple CUDA Versions (#19505)" (#19756)
This reverts commit f0053d405b.
2021-10-26 12:55:20 -07:00
Avnish Narayan
ad87ddf93e
[rllib] Add deterministic test to gpu (#19306)
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-10-26 10:11:39 -07:00
Amog Kamsetty
f0053d405b
[Docker] Support multiple CUDA Versions (#19505)
* wip

* wip

* update

* finish

* deprecate

* debug

* fix and address comments

* try catch

* fix

* split tests

* force

* merge

* docs

* wip

* fix and check

* update readme

* fix

* fix

* fix sanity checking

* format
2021-10-25 18:57:05 -07:00
Jiajun Yao
256bf0bf3a
[Release] Bump up dask to latest compatible version 2021.9.1 (#19592)
* Bump up dask to latest compatible version 2021.9.1

* Bump up dask to latest compatible version 2021.9.1
2021-10-22 09:16:28 -07:00
Simon Mo
03805d4064
[Serve] Good error message when Serve not installed and ensure Serve installs ray[default] (#19570) 2021-10-21 13:47:29 -07:00
mwtian
098ff36faa
[CI] Remove config that disables Bazel test result cache (#18701) 2021-10-19 13:31:42 -07:00
architkulkarni
b8941338d3
[runtime env] Raise error when creating runtime env when ray[default] is not installed (#19491) 2021-10-19 09:16:04 -05:00
matthewdeng
4674c78050
[Train] Rename Ray SGD v2 to Ray Train (#19436) 2021-10-18 22:27:46 -07:00
Antoni Baum
e9df253f5d
[CI/docs] Remove [default] from xgboost-ray (#19186)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-14 16:29:55 +01:00
Kai Fricke
d8d8901192
[ci/tune] Remove deprecated jenkins_only tag from test tags (#19287) 2021-10-12 10:05:46 +01:00
Matti Picus
9ca34c7192
add dependencies to BUILD.bazel and update windows bazel to 4.2.1 (#19132)
* add dependencies to BUILD.bazel and update windows bazel to 4.2.1

* fixes from review
2021-10-11 10:25:19 -07:00
SangBin Cho
0ef0d9a77d
Revert "[core] Assign tasks to the first available worker (#18167)" (#19180)
This reverts commit 545db13800.
2021-10-07 10:38:37 -07:00
Stephanie Wang
545db13800
[core] Assign tasks to the first available worker (#18167)
* Convert worker pool to queue

* Start up to backlog size more workers

* fixes

* Prestart workers according to num available CPUs

* lint

* x

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* dedicated workers

* Fix tests

* x

* fix

* asan

* asan

* Workers can only exec tasks with same job ID

* size_t for runtime env hash, fix unit tests

* include job ID in runtime env hash, remove from worker registration msg

* x

* conflict

* debug

* Schedule and dispatch periodically, skip if no new tasks

* Update src/ray/common/task/task_spec.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/scheduling/cluster_task_manager.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-10-05 13:45:50 -07:00
Kai Fricke
3dc176c42e
[ci/tune] Add SGD and Tune GPU pipeline step to CI (#18469)
* [ci/tune] Add Tune GPU pipeline step to CI

* cont.

* add sgd gpu tests

* format yaml, fix imports

* install horovod; fix line wrapping

* set GPU per worker to 0.5

* fix import

* move test to 4gpu machine

* fix lint

* lint

* set visible devices

* pull in tf gpu fix

* Fix Tune GPU pipeline step

* nit

* Disable GPU tests until we have some

* Re-add empty rllib tests

Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
2021-10-01 18:34:05 -07:00
architkulkarni
0f0b161ea1
Revert "Revert "[Serve] [doc] Improve runtime env doc"" (#18943)
* Revert "Revert "[Serve] [doc] Improve runtime env doc (#18782)" (#18935)"

This reverts commit e4f4c79252.
2021-09-30 13:28:44 -05:00
Yi Cheng
e4f4c79252
Revert "[Serve] [doc] Improve runtime env doc (#18782)" (#18935)
This reverts commit d4d71985d5.
2021-09-27 21:52:13 -07:00
architkulkarni
d4d71985d5
[Serve] [doc] Improve runtime env doc (#18782) 2021-09-27 16:12:03 -05:00
mwtian
43ac18bbc0
[Build] include minimal debug info in C++ build; upgrade clang-format to 12 (#18888)
* Revert "Revert "[Build] include minimal debug info in C++ build; upgrade clang-format to 12 (#18840)" (#18886)"

This reverts commit f851a072f3.

* use gcc 8
2021-09-24 17:59:05 -07:00
Chen Shen
f851a072f3
Revert "[Build] include minimal debug info in C++ build; upgrade clang-format to 12 (#18840)" (#18886)
This reverts commit 07e1366383.
2021-09-24 12:55:08 -07:00
mwtian
07e1366383
[Build] include minimal debug info in C++ build; upgrade clang-format to 12 (#18840)
* debug info and clang-format

* doc

* fix

* no clang-format on all files

* gcc

* keep gcc 7
2021-09-24 12:26:33 -07:00
Chen Shen
35aa944ef4
Fix thread-safety in global state accessor (#18746) 2021-09-19 12:01:31 -07:00
mwtian
efdbfcfdfb
[Build] Generate Bazel config for compiling with clang and libc++ in CI (#18622)
* Add Bazel config for building with llvm. Upgrade C++ std to 17.

* Fix redis. Try fixing asan and tsan

* Fix asan and format

* Update comments.

Co-authored-by: Chen Shen <scv119@gmail.com>
2021-09-17 19:01:07 -07:00
Sven Mika
8a72824c63
[RLlib Testig] Split and unflake more CI tests (make sure all jobs are < 30min). (#18591) 2021-09-15 22:16:48 +02:00
Antoni Baum
7e95f330d5
[ci] Fix xgboost_ray install from git (#18640) 2021-09-15 18:07:15 +01:00
Edward Oakes
7736cdd91d
[dashboard] Rename "new_dashboard" -> "dashboard" (#18214) 2021-09-15 11:17:15 -05:00
Antoni Baum
eeb67a42cc
pip install xgboost_ray -> xgboost_ray[default] (#18607)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-09-15 14:45:56 +01:00
Simon Mo
497c5f56fa
[CI] Temporary disable worker-in-container test (#18606)
* revert again

* disable tmp
2021-09-14 22:38:20 -07:00
SangBin Cho
0684531e22
[Test] Break down placement group tests (#18612) 2021-09-14 21:55:18 -07:00