Commit graph

10315 commits

Author SHA1 Message Date
Eric Liang
456d73754a
[data] Initial pass at support multiple-block returns for read and transform tasks (#19660) 2021-10-29 14:21:56 -07:00
SangBin Cho
f2b831f50f
[Placement Group] Fix the implicit value change from uint32_t -> uint64_t for pg scheduling retry (#19882)
* .

* done

* done
2021-10-29 12:16:53 -07:00
Philipp Moritz
0a5942d8b0
[Documentation] Fix quotes for windows installations (#19859)
* [Documentation] Fix quotes for windows installations

* update

* formatting
2021-10-29 10:54:38 -07:00
Lixin Wei
1fe9f3372e
[Nightly Test] Remove duplicate printing code (#19874)
## Why are these changes needed?

Remove duplicate printing code
2021-10-29 10:19:19 -07:00
Lixin Wei
56301e34b2
[Refactor] Remove ServiceBased Abstraction (#19694)
## Why are these changes needed?

Prior to this PR, we have:
```cpp
class XxxAccessor {}
class ServiceBasedXxxAccessor : public XxxAccessor{}

class GcsClient {}
class ServiceBasedGcsClient : public GcsClient{}
```

However, XxxAccessor has only one implementation: ServiceBasedXxxAccessor. And GcsClient has only one implementation: ServiceBasedGcsClient.

I think this abstraction is not necessary and will make development hard(I have to modify two files every time).

This PR removes all ServiceBasedXxx and moves its implementations to the base class.

Now we only have:
```cpp
class XxxAccessor {}
class GcsClient {}
```
2021-10-29 10:16:14 -07:00
Gagandeep Singh
9460a5375b
Added retry logic in test_basic::test_ray_options (#19832)
* Added retry logic in test_ray_options

* Applied linting format

* Made test consistent
2021-10-29 10:15:12 -07:00
architkulkarni
fdefd875c3
[Doc] [runtime env] Move runtime env section up one level, add inbound links (#19863) 2021-10-29 12:02:39 -05:00
SangBin Cho
4586ced5e4
Limit the max number of resource usage print (#19828)
* done

* done

* addressed code review

* done
2021-10-29 07:24:14 -07:00
Edward Oakes
bf23a31017
[job submission] Always generate and return job_id (#19851) 2021-10-29 09:09:54 -05:00
SangBin Cho
16dcff4091
[Core/RuntimeEnv] Fix runtime environment hanging issues. (#19823)
* done

* Add a right test

* Fix unit tests

* fix issues
2021-10-29 07:01:56 -07:00
Kai Fricke
fa0158abe5
[tune] Cloud checkpointing release tests (#19638) 2021-10-29 12:12:01 +02:00
Sven Mika
9c73871da0
[RLlib; Docs overhaul] Docstring cleanup: Evaluation (#19783) 2021-10-29 12:03:56 +02:00
Antoni Baum
f2773267c7
[docs] Tune doc fixes (#19791) 2021-10-29 11:45:29 +02:00
qicosmos
246a901aea
[C++ API] Support object ref args (#19550) 2021-10-29 17:36:53 +08:00
Kai Fricke
a13f738a10
[ci/release] Fix cloud search query (#19876) 2021-10-29 11:30:34 +02:00
Rohan138
b9c9cc5946
[RLlib] Updated PettingZoo+RLlib tutorial; Removed pettingzoo example script (#19069)
* Updated PettingZoo+RLlib tutorial

Updated the tutorial and added link to the blog post by the PettingZoo team.

* Ran linting

* Converted link to tinyurl for linting

* fixed line lengths

* Decrease num_workers to 1

* Added comments

* Decreased num_workers

* Decreased timesteps

* Increased num_workers

* Update links and remove pettingzoo_env.py

* remove pettingzoo.py script from tests

Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-10-29 10:57:10 +02:00
Sven Mika
902e854af2
[RLlib; Docs overhaul] Docstring cleanup: Environments. (#19784)
* wip.

* Test: Make a change in tune to trigger tune tests, which are not run otherwise, but seem to fail nevertheless with this PR's changes.

* remove bare_metal_policy_with_custom_view_reqs from tests
2021-10-29 10:46:52 +02:00
Stephanie Wang
e6d60d7376
[core] Fail objects when pull/reconstruction hangs (#19789) 2021-10-28 23:34:51 -07:00
Yi Cheng
68ec652be7
[gcs] New option to increase gcs grpc client threads and fix issues in hybrid scheduling (#19663)
## Why are these changes needed?

- Since broadcasting is moving to grpc, introducing the option to increase the client side thread number
- For hybrid schedule, ignore the threshold if gcs based actor scheduler is enabled

With these fixing, actor creation rate > 600actor/s vs ~ 140 actor/s

## Related issue number
2021-10-28 22:40:18 -07:00
Chris K. W
bd4ad84ead
[Client] Add deprecation warnings for direct ray.client().connect() calls (#18783)
* add deprecation warning

* Update wording

* add test

* actually connect

* add env var tests

* fix message and test

* skip on windows

* add _LocalBuilder case, update test_namespace

* better variable name
2021-10-28 22:06:11 -07:00
Eric Liang
1ba07439fc
Reduce log level of concurrent actor creation 2021-10-28 20:44:14 -07:00
qicosmos
efef38f240
[C++ Worker] Add basic ref counting test cases (#17768) 2021-10-29 11:22:19 +08:00
Jiajun Yao
760878f950
Handle empty dataset for sort and groupby (#19849) 2021-10-28 18:49:33 -07:00
Simon Mo
0433281ec8
[CI] Bump Serve test_regression to medium for windows (#19844) 2021-10-28 17:49:50 -07:00
Philipp Moritz
0633ae45e9
[Documentation] Remove note about windows wheels needing dev runtime (#19847) 2021-10-28 16:59:58 -07:00
Kai Fricke
564d8551ed
[ci/release] only check alert if test succeeded before (#19857) 2021-10-28 16:09:10 -07:00
Edward Oakes
42ac906313
[job submission] Support passing metadata to the JobConfig (#19845) 2021-10-28 16:40:03 -05:00
SangBin Cho
9126810c41
[Usabiilty] Improve the serialization failure message (#19691)
* Done

* done

* Done

* fix test

* Adressed code review.

* done

* done

* fix mistake

* Skip tests on windows
2021-10-28 14:25:51 -07:00
matthewdeng
bfb0ef1b08
move jsonschema to core dependencies and update default AutoscalerPrometheusMetrics (#19831) 2021-10-28 13:04:22 -07:00
SangBin Cho
96fc875a89
[Core] Improve scheduling observability and fix wrong resource deadlock report message. (#19746) 2021-10-28 11:42:21 -07:00
SangBin Cho
39486ef08c
[Core] Fix the resource leak if custom resources don't exist. #19837
Why are these changes needed?
The current logic can cause resource leak if AllocateTaskResourceInstances is requested with the custom resources that don't exist in the local node. The original assumption was the caller will free resources when it returns false, but it is an error prone API, and it actually turns out that we don't do this anywhere.

Related issue number
Closes #17044
2021-10-28 11:00:34 -07:00
Amog Kamsetty
1803d88943
[Train] Simplify single worker training (#19814)
* wip

* update

* fix

* fix

* fix

* fix
2021-10-28 10:54:35 -07:00
shrekris-anyscale
6e6fff8857
[serve] Enable deployment of functions/classes that take no parameters (#19708) 2021-10-28 12:53:44 -05:00
Jiao
ed0e2e4fd7
[job submission] Add job_config in subprocess driver script (#19765) 2021-10-28 12:12:51 -05:00
gjoliver
d81885c1f1
[RLlib] Fix all the CI tests that were broken by is_training and replay buffer changes; re-comment-in the failing RLlib tests (#19809)
* Fix DDPG, since it is based on GenericOffPolicyTrainer.

* Fix QMix, SAC, and MADDPA too.

* Undo QMix change.

* Fix DQN input batch type. Always use SampleBatch.

* apex ddpg should not use replay_buffer_config yet.

* Make eager tf policy to use SampleBatch.

* lint

* LINT.

* Re-enable RLlib broken tests to make sure things work ok now.

* fixes.

Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-10-28 18:06:47 +02:00
Jiajun Yao
fe8138bfc2
Listen to 127.0.0.1 if node ip is 127.0.0.1 (#19810) 2021-10-28 08:44:23 -07:00
Simon Mo
5e927b01ad
Revert "[CI] Remove config that disables Bazel test result cache" (#19818)
* Revert "[CI] Remove config that disables Bazel test result cache (#18701)"

This reverts commit 098ff36faa.

* Remove all RLlib tests from BUILD that currently fail.

Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-10-28 15:54:53 +02:00
Eric Liang
f60d312259
Try fixing reference counting issue with manual _owner assignment (#19734) 2021-10-28 02:26:35 -07:00
Guyang Song
119318932a
remove the env config 'RAY_DASHBOARD_MODULE_EVENT' (#19629) 2021-10-28 16:51:59 +09:00
SangBin Cho
c414eb20d5
[Internal Observability] Improve the per task/actor resource usage visibility (#19782)
* prototype done

* done
2021-10-28 00:21:22 -07:00
Patrick Ames
8a9f664d75
[data] Add support for custom dataset block write path providers. (#19347)
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-10-28 00:12:02 -07:00
Chen Shen
224ed0fa5c
[Core][CoreWorker] graceful shutdown if GetCoreWorker is null (#19598)
There are cases that the language frontend calls GetCoreWorker() after the worker has already been shutdown. Currently this results in a crash and causes confusions.

pid=3714) [2021-10-21 10:50:23,596 C 3714 33544237] core_worker.cc:194:  Check failed: core_worker_process The core worker process is not initialized yet or already shutdown.
(pid=3714) *** StackTrace Information ***
(pid=3714)     ray::GetCallTrace()
(pid=3714)     ray::SpdLogMessage::Flush()
(pid=3714)     ray::SpdLogMessage::~SpdLogMessage()
(pid=3714)     ray::RayLog::~RayLog()
(pid=3714)     ray::core::CoreWorkerProcess::EnsureInitialized()
(pid=3714)     ray::core::CoreWorkerProcess::GetCoreWorker()
(pid=3714)     __pyx_pw_3ray_7_raylet_10CoreWorker_23get_worker_id()
(pid=3714)     _PyMethodDef_RawFastCallKeywords
(pid=3714)     _PyMethodDescr_FastCallKeywords
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     property_descr_get
(pid=3714)     _PyObject_GenericGetAttrWithDict
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     _PyEval_EvalCodeWithName
(pid=3714)     _PyFunction_FastCallKeywords
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     method_call
(pid=3714)     PyObject_Call
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     call_function
(pid=3714)     _PyEval_EvalFrameDefault
(pid=3714)     function_code_fastcall
(pid=3714)     method_call
(pid=3714)     PyObject_Call
(pid=3714)     t_bootstrap
(pid=3714)     pythread_wrapper
(pid=3714)     _pthread_start
(pid=3714)     thread_start
2021-10-27 23:11:53 -07:00
Jiajun Yao
7fb65abae1
[data] Fix dataset doc (#19821) 2021-10-27 22:41:09 -07:00
Alex Wu
46965e7672
[ARM] Use uint64_t instead of unsigned long (#13774)
Co-authored-by: Alex Wu <alex@anyscale.com>
2021-10-27 21:08:25 -07:00
Jiajun Yao
11751a1d87
Arrow block dataset groupBy (#19673) 2021-10-27 16:27:11 -07:00
Edward Oakes
b2e12dc43b
[runtime_env] Add basic support for python modules (#19651) 2021-10-27 17:56:46 -05:00
gjoliver
39b0faa3ec
[RLlib]: bug fix, should be input_dict['is_training'] (#19805) 2021-10-27 23:30:43 +02:00
Sven Mika
4a82d3ea6c
Revert "[RLlib; Docs overhaul] Docstring cleanup: Trainer, trainer_template, Callbacks. (#19758)" (#19806)
This reverts commit 80eeb13175.
2021-10-27 23:30:07 +02:00
Simon Mo
3e038aebb2
[CI] Allow release tests infra to accept buildkite artifacts (#19803) 2021-10-27 13:04:01 -07:00
Yi Cheng
98961d1ee2
[core] Fix the wrong error message in gcs for worker exits (#19774) 2021-10-27 12:55:27 -07:00