Commit graph

11393 commits

Author SHA1 Message Date
Sven Mika
1c791b71d8
[RLlib] Fix Unity3D built-in examples action bounds from -inf/inf to -1.0/1.0. (#22247) 2022-02-10 03:00:30 +01:00
Sven Mika
44d09c2aa5
[RLlib] Filter.clear_buffer() deprecated (use Filter.reset_buffer() instead). (#22246) 2022-02-10 02:58:43 +01:00
Sven Mika
637cacedc9
[RLlib] Discussion 4986: OU Exploration (torch) crashes when restoring from checkpoint. (#22245) 2022-02-10 02:58:09 +01:00
Sven Mika
c73e0597fa
[RLlib] Discussion 2022: Fix batch_mode="complete_episodes" documentation inaccuracy. (#22074) 2022-02-10 02:57:27 +01:00
Jiajun Yao
d295a9d545
Revert "[Datasets] Support ignoring NaNs in aggregations. (#20787)" (#22258)
This reverts commit f264cf800a.
2022-02-09 16:25:19 -08:00
SangBin Cho
30000ff8ae
Fix a bug from many drivers. (#22248)
After this PR (https://github.com/ray-project/ray/pull/22156), for some reasons the driver script has some string that cannot be encoded with ascii. It seems like using utf-8 solves the problem.
2022-02-09 15:17:15 -08:00
SangBin Cho
e5cab878b8
[Core] Disable runtime env logs (#22198)
Disable runtime env logs streamed to the driver by default and improve the documentation.
2022-02-09 14:43:25 -08:00
xwjiang2010
fc88b0895e
[tune] fix //rllib:tests/test_placement_groups (#22256) 2022-02-09 14:42:31 -08:00
Alex Wu
35028182f0
[Autoscaler] No infeasible warning for placement groups (#22235)
Ensures that we don't log a warning message about an infeasible resource demand when that custom resource is a placement group (since the placement group resource likely just needs to be created by the raylet).

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-02-09 12:47:36 -08:00
Archit Kulkarni
54b2e143e4
[Doc] [Jobs] Add size limit and recommendations for working_dir (#22219)
Previously it wasn't obvious which working_dir option was recommended, and the size limit for local working_dir didn't appear on the Jobs page.   (The user would have had to go to the runtime_env API reference to see the size limit.). This PR makes this information more prominent.
2022-02-09 13:56:02 -06:00
Archit Kulkarni
50e2bef9d0
[Jobs] Hide dashboard from Job Submission import path (#22223)
For public SDK APIs, change the import path from 

```python
from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo
from ray.dashboard.modules.job.sdk import JobSubmissionClient
```

to 
```python
from ray.job_submission import JobStatus, JobSubmissionClient
```

`JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.
2022-02-09 13:55:32 -06:00
Jiao
293e45c527
[Doc] [Serve] Fix README's quick_start and add to test suite (#22228) 2022-02-09 11:49:47 -08:00
Jiao
54a71e6c4f
[Ray DAG] Add execute() interface to take user inputs with ENTRY_POINT tag. (#22196)
## Diff Summary

Current implementation of DAGNode pre-bind inputs and the signature of `def execute(self)` doesn't take user input yet. This PR extends the interface to take user input, mark DAG entrypoint methods as first stop of all user requests in a DAG. It's needed to unblock next step serve pipeline implementation to serve user requests.

Closes #22196 #22197

Notable changes:
- Added a `DAG_ENTRY_POINT` flag in ray dag API to annotate DAG entrypoint functions. Function or class method only. All marked functions will receive identical input from user as first layer of DAG. 
- Changed implementations of ClassNode and FunctionNode accordingly to handle different execution for a node marked as entrypoint or not.
- Added a `kwargs_to_resolve` kwarg in the interface of `DAGNode` to handle args that sub-classes need to use to resolve it's implementation without exposing all terms to parent class level.
  - This is particularly important for ClassMethodNode binding, so we can have implementations to track method name, parent ClassNode as well as previous class method call without existiting 
  - Changed implementation of `_copy()` to handle execution of `kwargs_to_resolve`.
  - Changed implementation of `_apply_and_replace_all_child_nodes()` to fetch DAGNode type in `kwargs_to_resolve`.
- Added pretty printed lines for `kwargs_to_resolve`
2022-02-09 13:29:28 -06:00
Jiajun Yao
673ecd1241
Isolate ray configs for each job (#22206)
If we run multiple jobs in the same process (this is basically the behavior of python tests), they should be isolated in the sense that system config for job 1 shouldn't affect config for job 2.
```
ray.init(_system_config={})
# job 1
ray.shutdown()

ray.init(_system_config={})
# job 2
ray.shutdown()
```

Currently it's not the case, since RayConfig is a static variable and it's shared across drivers in the same process. This PR resets the configs to default value before applying job specific _system_config.

Note: it's backward incompatible change if user depends on the current behavior but I'm not aware of such case.
2022-02-09 10:18:46 -08:00
Alex Wu
b122f093c1
Revert "[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test." (#22250)
Reverts ray-project/ray#22126

Breaks rllib:tests/test_io
2022-02-09 09:26:36 -08:00
Alex Wu
c9a419ac76
[Autoscaler] Remove staroid node provider (#22236)
The Staroid node provider has been abandoned and unmaintained for quite some time now. Due to the fact that there are no active maintainers, the original contributors cannot be reached, and there is no clear interest, we are no longer officially endorsing or supporting the node provider.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-02-09 09:18:18 -08:00
xwjiang2010
323511b716
[tune] Single wait refactor. (#21852)
This is a down scoped change. For the full overview picture of Tune control loop, see [`Tune control loop refactoring`](https://docs.google.com/document/d/1RDsW7SVzwMPZfA0WLOPA4YTqbRyXIHGYmBenJk33HaE/edit#heading=h.2za3bbxbs5gn)

1. Previously there are separate waits on pg ready and other events. As a result, there are quite a few timing tweaks that are inefficient, hard to understand and unit test. This PR consolidates into a single wait that is handled by TrialRunner in each step.
- A few event types are introduced, and their mapping into scenarios
  * PG_READY --> Should place a trial onto it. If somehow there is no trial to be placed there, the pg will be put in _ready momentarily. This is due to historically resources is conceptualized as a pull based model. 
  * NO_RUNNING_TRIALS_TIME_OUT --> possibly not sufficient resources case
  * TRAINING_RESULT
  * SAVING_RESULT
  * RESTORING_RESULT
  * YIELD --> This just means that simply taking very long to train. We need to punt back to the main loop to print out status info etc.

2. Previously TrialCleanup is not very efficient and can be racing between Trainable.stop() and `return_placement_group`. This PR streamlines the Trial cleanup process by explicitly let Trainable.stop() to finish followed by `return_placement_group(pg)`. Note, graceful shutdown is needed in cases like `pause_trial` where checkpointing to memory needs to be given the time to happen before the actor is gone. 

3. There are quite some env variables removed (timing tweaks), that I consider OK to proceed without deprecation cycle.
2022-02-09 15:31:17 +00:00
Artur Niederfahrenhorst
dea3574050
[RLlib] Replay Buffer API (#22114) 2022-02-09 15:04:43 +01:00
Jun Gong
3207f537cc
[RLlib] RecSim Interest evolution environment should use custom video sampler: IEvVideoSampler due to only one cluster being used. (#22211) 2022-02-09 10:29:35 +01:00
Clark Zinzow
f264cf800a
[Datasets] Support ignoring NaNs in aggregations. (#20787)
Adds support for ignoring NaNs in aggregations. NaNs will now be ignored by default, and the user can pass in `ds.mean("A", ignore_nulls=False)` if they would rather have the NaN be propagated to the output. Specifically, we'd have the following null-handling semantics:
1. Mix of values and nulls - `ignore_nulls`=True: Ignore the nulls, return aggregation of values
2. Mix of values and nulls - `ignore_nulls`=False: Return `None`
3. All nulls: Return `None`
4. Empty dataset: Return `None`

This all null and empty dataset handling matches the semantics of NumPy and Pandas.
2022-02-09 00:07:58 -08:00
Ishant Mrinal
f0d8b6d701
[RLlib] Fix compute_actions() for Trainer due to missing if prev_actions/rewards is not None checks. (#22078) 2022-02-09 09:05:26 +01:00
mwtian
71f63593f4
[Client] avoid locking in async send (#22193)
As @iycheng discovered in https://github.com/ray-project/ray/issues/22082#issuecomment-1031821631, when `ClientObjectRef` is being GC'ed, `DataClient.lock` is acquired which may cause deadlock. This change avoids acquiring lock in `DataClient._async_send()`.
2022-02-08 22:14:15 -08:00
SangBin Cho
20ab9188c6
[Ray Usage Stats] Record cluster metadata + Refactoring. (#22170)
This is the first PR to implement usage stats on Ray. Please refer to the file `usage_lib.py` for more details.

The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj.

You can see the full PR for phase 1 from here; https://github.com/rkooo567/ray/pull/108/files.

The PR is doing some basic refactoring + adding cluster metadata to GCS instead of the version numbers. 

After this PR, we will add code to enable usage report "off by default".
2022-02-08 22:12:36 -08:00
SangBin Cho
d7cead7519
[Core] Improve ray stop (#22159)
I’d like to make a small proposal to change the behavior of ray stop

### Status quo
It basically just sends a SIGTERM and finish asynchronously. It is vulnerable to have leaked processes or port conflict when you run ray stop && ray start I feel like the right behavior is as follow;

### New
- Send sigterm and wait for processes to terminate
- Display the progress to users
- If procs are not terminated by X seconds, send SIGKILL.

### API change
We will add `--grace-period` flag. The default is ray `stop --grace-period=X (10 seconds by default)`

And if users don’t want to be blocked, they can use ray stop --force (which already exists. It just sends SIGKILL, so procs are guaranteed to be terminated)
2022-02-08 22:07:44 -08:00
Yi Cheng
8b1bbfe8e4
[e2e] Fix an error when "env_vars" is not set. (#22234)
To fix error in session https://buildkite.com/ray-project/periodic-ci/builds/2699#c532ed2b-ee89-48ad-a7db-fd4211ef8bd9
2022-02-08 22:05:53 -08:00
Yi Cheng
d8ac01bd5c
[e2e] Update e2e test to use redisless ray by default. (#22189)
As title, after infra got updated, we need to merge the PR so that test can run ray without redis.
2022-02-08 19:46:48 -08:00
Chen Shen
1abe69e9b7
[refactor cluster-task-manage 1/n] separate resource reporting logic into helper class (#22215)
Separate Scheduler Resource Reporting logic into a separate class for better readability and maintainability.
2022-02-08 17:22:05 -08:00
Gagandeep Singh
000c56f764
[Serve] [Windows] Unskip all but test_redeploy_single_replica in test_deploy.py (#21391) 2022-02-08 16:30:25 -08:00
Balaji Veeramani
da343b91bb
[CI] Delete .style.yapf (#22199) 2022-02-08 16:29:58 -08:00
Balaji Veeramani
31ed9e5d02
[CI] Replace YAPF disables with Black disables (#21982) 2022-02-08 16:29:25 -08:00
Stephanie Wang
dcd96ca348
[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope (#22120)
When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope.

This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not.

This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.

This is a re-merge for #21719 with a fix for removing the owned object ref if creation fails.
2022-02-08 14:50:50 -08:00
Jules S. Damji
6b7d995e64
Added a hands-on self-containted MLflow/Ray Serve deployment example (#22192) 2022-02-08 12:10:53 -08:00
Simon Mo
a3efee7ecf
[Serve] Add regression test for out of order submit (#20629) 2022-02-08 10:38:36 -08:00
Sven Mika
ac3e6ab411
[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test. (#22126) 2022-02-08 19:04:13 +01:00
Nikita Vemuri
d19aaf0fd3
[jobs] Add unit test for parse_cluster_info (#22205)
Add unit test to check addresses of various formats are correctly passed to `get_job_submission_client_cluster_info`.
2022-02-08 11:22:28 -06:00
Sven Mika
c17a44cdfa
Revert "Revert "[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learni…" (#22153) 2022-02-08 16:43:00 +01:00
Guyang Song
36ba514f9c
[Doc] Fix bad doc and recover doc of c++ api (#22213) 2022-02-08 19:04:37 +08:00
Guyang Song
9f77090c1c
[Doc] Fix bad links of dask and mars in ray-libraries.rst (#22210) 2022-02-08 19:02:49 +08:00
Gagandeep Singh
0f2a2224c2
PoolActor now uses num_cpus=0 to avoid any deadlock (#22048)
https://github.com/ray-project/ray/issues/21488#issuecomment-1027122177 :

> We discussed this issue in a bit more detail and came to the conclusion that we should set the CPU resource requirement for each actor in the actor pool to 0, to make the Ray Pool compatible/same behavior as the Python multiprocessing pool. Would that work for you @yogeveran ? (very similar to solution 4 mentioned above, but with 0.0 instead of 0.1, so it works in all cases).
2022-02-08 01:59:46 -08:00
SangBin Cho
1c41b0f566
[Test] Unflake pg test + add pg tests that weren't running (#22204)
Unflake pg test (pg test 3 times out occasionally)+ add pg tests that weren't running
2022-02-08 01:47:22 -08:00
SangBin Cho
ac00389cbe
[Nightly test] Bring back the old way of running commands. (#22209)
Bring back the old way of running commands for non-k8s tests.

This also fixes the regression from many_drivers.py
2022-02-08 01:44:07 -08:00
Sriram Sankar
d06317eb1a
[Kuberay] Updated kuberay-autoscaler.yaml to create service account (#22188)
Added lines to autoscaler configuration yaml to create a service account that is used to give the autoscaler permissions to list and read pods and patch the cluster CRD for up/downscaling.
2022-02-07 22:04:34 -08:00
Eric Liang
8f7db1c6ab
Properly release resources of workers exiting due to max_calls (#22146)
Previously code incorrectly assumed that an exiting worker would disconnect from the raylet promptly to release resources. This isn't the case if the worker is owning references. This PR plumbs through the right release resources call even in this scenario.

Closes https://github.com/ray-project/ray/issues/10960

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2022-02-07 21:57:11 -08:00
Balaji Veeramani
ee1711fe41
[CI] Remove YAPF from format.sh (#21986) 2022-02-07 16:05:27 -08:00
Jiajun Yao
56c7b74072
Delete nightly shuffle_data_loader (#22185) 2022-02-07 15:23:34 -08:00
Archit Kulkarni
de2c950d55
[runtime env] Unify checks for empty runtime env using helper function (#22129)
Followup from https://github.com/ray-project/ray/pull/21788.  Previously we had a lot of `serialized_runtime_env == "{}" || serialized_runtime_env == ""` scattered around the C++ code; this PR puts this in a helper function.
2022-02-07 17:18:51 -06:00
Eric Liang
428d594d35
Also allow auto-closing of stale PRs (#22149)
Allow auto-close of stale PRs, with a shorter time limit.
2022-02-07 14:34:59 -08:00
Eric Liang
00b5801d71
Fix datasets leaking worker processes due to closure capture of stats actor handle (#22156) 2022-02-07 14:05:44 -08:00
Edward Oakes
8806b2d5c4
[jobs] Monitor jobs in the background to avoid requiring clients to poll (#22180) 2022-02-07 15:25:25 -06:00
Guyang Song
8e1e783596
fix "team:xxx" tag of cpp tests #22163
Cpp worker tests should be part of ray core.
2022-02-07 11:33:55 -08:00