Commit graph

102 commits

Author SHA1 Message Date
Simon Mo
ca90c63483
[Serve] Add serve failure test to CI (#20392) 2021-11-16 08:12:08 -08:00
mwtian
a39fd74674
disable //python/ray/tests:test_autoscaler_drain_node_api in HA GCS build (#20296) 2021-11-12 15:47:42 -08:00
xwjiang2010
ce8504b0b2
[CI] Rebalance Tune tests a bit. (#20263) 2021-11-12 15:30:18 +00:00
chenk008
74fa267c72
Enable worker in container CI test (#20174) 2021-11-11 16:11:06 -08:00
mwtian
0330852baf
[Core][Pubsub] Implement Python GCS publisher and subscriber (#20111)
## Why are these changes needed?
This change adds Python publisher and subscriber in `gcs_utils.py`, and GRPC handler on GCS for publishing iva GCS. Error info is migrated to use the GCS-based pubsub, if feature flag `RAY_gcs_grpc_based_pubsub=true`.

Also, add a `--gcs-address` flag to some Python processes. It is not set anywhere yet, but will be set aftering Redis-less bootstrapping work.

Unit tests are added for the Python publisher and subscriber. Migrated error info publishers and subscribers are tested with existing unit tests, e.g. tests calling `ray._private.test_utils.get_error_message()` to ensure error info is published.

GCS based pubsub has gaps in handling deadline, cancelled requests and GCS restarts. So 3 more unit tests are disabled in the `HA GCS` mode. They will be addressed in a separate change.

## Related issue number
2021-11-11 14:59:57 -08:00
xwjiang2010
883fbd003c
[CI; Tune] Split Tune tests and examples (#20210)
* Split Tune tests and examples part 1 into tests and examples separate.

* fix typo.

* fix typo.

* Add docs.
2021-11-11 10:50:51 +01:00
Sven Mika
ebd56b57db
[RLlib; documentation] "RLlib in 60sec" overhaul. (#20215) 2021-11-10 22:20:06 +01:00
Sven Mika
50c30f89c6
[Tune; RLlib] Move Tune tests that use RLlib into separate buildkite job. (#20016) 2021-11-04 20:40:57 +01:00
Yi Cheng
65d3054a09
[build] fix the wrong flag for gcs ha test (#20052)
## Why are these changes needed?
It should be `RAY_gcs_grpc_based_pubsub` instead of `Ray_gcs_grpc_based_pubsub`

## Related issue number
2021-11-04 09:59:11 -07:00
Sven Mika
4cb23d1c95
[Tune; Testing] Revert to 3.7 (undone by accident by previous PR); + some minor comment cleanups. (#20031) 2021-11-04 10:58:34 +01:00
mwtian
f83195a1e1
[Build] Add GCS HA builds (#20008)
## Why are these changes needed?
Add builds for Python tests with GCS pubsub enabled.

## Related issue number
2021-11-03 11:58:16 -07:00
Avnish Narayan
026bf01071
[RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535)
* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7

* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

* Reformatting

* Fixing tests

* Move atari-py install conditional to req.txt

* migrate to new ale install method

* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7
* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

Move atari-py install conditional to req.txt

migrate to new ale install method

Make parametric_actions_cartpole return float32 actions/obs

Adding type conversions if obs/actions don't match space

Add utils to make elements match gym space dtypes

Co-authored-by: Jun Gong <jungong@anyscale.com>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-03 16:24:00 +01:00
Sven Mika
e6ae08f416
[RLlib] Optionally don't drop last ts in v-trace calculations (APPO and IMPALA). (#19601) 2021-11-03 10:01:34 +01:00
Sven Mika
2d24ef0d32
[RLlib] Add all simple learning tests as framework=tf2. (#19273)
* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and Tune tests have
been moved to python 3.7

* fix tune test_sampler::testSampleBoundsAx

* fix re-install ray for py3.7 tests

Co-authored-by: avnishn <avnishn@uw.edu>
2021-11-02 12:10:17 +01:00
mwtian
7afdfdc6dd
[CI] narrow down tests that run when files change (#19656) 2021-10-29 16:47:54 -07:00
matthewdeng
bfb0ef1b08
move jsonschema to core dependencies and update default AutoscalerPrometheusMetrics (#19831) 2021-10-28 13:04:22 -07:00
Amog Kamsetty
db863aafc0
Revert "Revert "[Docker] Support multiple CUDA Versions (#19505)" (#19756)" (#19763)
This reverts commit e58fcca404.
2021-10-26 17:32:56 -07:00
Amog Kamsetty
e58fcca404
Revert "[Docker] Support multiple CUDA Versions (#19505)" (#19756)
This reverts commit f0053d405b.
2021-10-26 12:55:20 -07:00
Avnish Narayan
ad87ddf93e
[rllib] Add deterministic test to gpu (#19306)
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-10-26 10:11:39 -07:00
Amog Kamsetty
f0053d405b
[Docker] Support multiple CUDA Versions (#19505)
* wip

* wip

* update

* finish

* deprecate

* debug

* fix and address comments

* try catch

* fix

* split tests

* force

* merge

* docs

* wip

* fix and check

* update readme

* fix

* fix

* fix sanity checking

* format
2021-10-25 18:57:05 -07:00
Jiajun Yao
256bf0bf3a
[Release] Bump up dask to latest compatible version 2021.9.1 (#19592)
* Bump up dask to latest compatible version 2021.9.1

* Bump up dask to latest compatible version 2021.9.1
2021-10-22 09:16:28 -07:00
Simon Mo
03805d4064
[Serve] Good error message when Serve not installed and ensure Serve installs ray[default] (#19570) 2021-10-21 13:47:29 -07:00
architkulkarni
b8941338d3
[runtime env] Raise error when creating runtime env when ray[default] is not installed (#19491) 2021-10-19 09:16:04 -05:00
matthewdeng
4674c78050
[Train] Rename Ray SGD v2 to Ray Train (#19436) 2021-10-18 22:27:46 -07:00
Kai Fricke
d8d8901192
[ci/tune] Remove deprecated jenkins_only tag from test tags (#19287) 2021-10-12 10:05:46 +01:00
SangBin Cho
0ef0d9a77d
Revert "[core] Assign tasks to the first available worker (#18167)" (#19180)
This reverts commit 545db13800.
2021-10-07 10:38:37 -07:00
Stephanie Wang
545db13800
[core] Assign tasks to the first available worker (#18167)
* Convert worker pool to queue

* Start up to backlog size more workers

* fixes

* Prestart workers according to num available CPUs

* lint

* x

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* dedicated workers

* Fix tests

* x

* fix

* asan

* asan

* Workers can only exec tasks with same job ID

* size_t for runtime env hash, fix unit tests

* include job ID in runtime env hash, remove from worker registration msg

* x

* conflict

* debug

* Schedule and dispatch periodically, skip if no new tasks

* Update src/ray/common/task/task_spec.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/scheduling/cluster_task_manager.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-10-05 13:45:50 -07:00
Kai Fricke
3dc176c42e
[ci/tune] Add SGD and Tune GPU pipeline step to CI (#18469)
* [ci/tune] Add Tune GPU pipeline step to CI

* cont.

* add sgd gpu tests

* format yaml, fix imports

* install horovod; fix line wrapping

* set GPU per worker to 0.5

* fix import

* move test to 4gpu machine

* fix lint

* lint

* set visible devices

* pull in tf gpu fix

* Fix Tune GPU pipeline step

* nit

* Disable GPU tests until we have some

* Re-add empty rllib tests

Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
2021-10-01 18:34:05 -07:00
architkulkarni
0f0b161ea1
Revert "Revert "[Serve] [doc] Improve runtime env doc"" (#18943)
* Revert "Revert "[Serve] [doc] Improve runtime env doc (#18782)" (#18935)"

This reverts commit e4f4c79252.
2021-09-30 13:28:44 -05:00
Yi Cheng
e4f4c79252
Revert "[Serve] [doc] Improve runtime env doc (#18782)" (#18935)
This reverts commit d4d71985d5.
2021-09-27 21:52:13 -07:00
architkulkarni
d4d71985d5
[Serve] [doc] Improve runtime env doc (#18782) 2021-09-27 16:12:03 -05:00
Chen Shen
35aa944ef4
Fix thread-safety in global state accessor (#18746) 2021-09-19 12:01:31 -07:00
mwtian
efdbfcfdfb
[Build] Generate Bazel config for compiling with clang and libc++ in CI (#18622)
* Add Bazel config for building with llvm. Upgrade C++ std to 17.

* Fix redis. Try fixing asan and tsan

* Fix asan and format

* Update comments.

Co-authored-by: Chen Shen <scv119@gmail.com>
2021-09-17 19:01:07 -07:00
Sven Mika
8a72824c63
[RLlib Testig] Split and unflake more CI tests (make sure all jobs are < 30min). (#18591) 2021-09-15 22:16:48 +02:00
Edward Oakes
7736cdd91d
[dashboard] Rename "new_dashboard" -> "dashboard" (#18214) 2021-09-15 11:17:15 -05:00
Simon Mo
497c5f56fa
[CI] Temporary disable worker-in-container test (#18606)
* revert again

* disable tmp
2021-09-14 22:38:20 -07:00
mwtian
a3f399ef10
[Client] fix propagating errors to async calls during disconnect, and other cleanup (#18539)
* cleanup tests and errors for clients

* Fix lock and async get

* rerun

* Avoid running callback under lock. Make lock non-reentrant

* Add all necessary apis

* Removed unused APIs
2021-09-14 18:48:27 +03:00
Yi Cheng
7d1f408de9
[workflow] Move experimental/workflow to workflow (#18521) 2021-09-13 17:45:18 -07:00
Chen Shen
5f57079041
use clang for C++ debug testing (#18343) 2021-09-09 15:48:36 -07:00
Simon Mo
a29da81cfc
Revert "Revert "Fix tracing bug when actors are defined before connecting to …" (#16122) 2021-09-07 16:19:49 -07:00
matthewdeng
a3123b6860
[SGD] v2 Horovod backend (#18047)
* [SGD] add Horovod backend

* address comments: set CUDA_VISIBLE_DEVICES, refactor code

* fix gpu test

* fix lint/test import

* address comments, add example cluster config

* delay horovod imports
2021-08-31 12:54:59 -07:00
Kai Fricke
a8dbc44f9a
[ci] minimal dependency install test (#18071) 2021-08-31 15:26:25 +02:00
Sven Mika
599e589481
[RLlib] Move existing fake multi-GPU learning tests into separate buildkite job. (#18065) 2021-08-31 14:56:53 +02:00
Antoni Baum
5be6bda4cf
[tests] Add Ludwig CI test (#18126) 2021-08-30 12:27:39 -07:00
Amog Kamsetty
3b77840c1b
PyTorch Lightning Updates (#17876) 2021-08-27 23:15:51 -07:00
Chen Shen
7e3e0d1535
[Test] Add C++ tsan test (#17875) 2021-08-24 00:57:32 -07:00
Chen Shen
880797d5c2
[Core][Test] Add ubsan support for C++ tests (#17812)
* support ubsan

* update
2021-08-17 10:22:03 -07:00
SangBin Cho
4971e13941
[Build] Asan wheel test (#17685)
* in progerss

* ASAN tests.

* d

* in progress

* in progress without the asan wheel

* Support the asan wheel.

* Support the asan wheels

* Not build a binary for asan

* Fix issues

* Remove a wrong build

* Separate out asan wheel build

* Try preparing more deps.

* ip

* Try different version

* done

* d

* Trial

* Another try

* Another try

* skip cpp build to see what happens

* add more des

* ip

* abc

* Try next

* completed

* try

* Try without static libasan

* dbg

* Try static link

* Fix issues

* abc
2021-08-17 10:21:41 -07:00
Sven Mika
f3bbe4ea44
[RLlib] Test cases/BUILD cleanup; split "everything else" (longest running one rn) tests in 2. (#17640) 2021-08-16 22:01:01 +02:00
Clark Zinzow
d6eeb5dc70
[Datasets] Add local and S3 filesystem test coverage for file-based datasources. (#17158) 2021-08-12 08:39:31 -07:00