Commit graph

11949 commits

Author SHA1 Message Date
Tao Wang
2ce3cd0073
[Hotfix]Fix compile failure (#23651) 2022-04-01 13:23:11 -07:00
Kai Fricke
9071b39f3e
[ci/release] Add buildkite output groups (#23658)
This makes the buildkite output easier to parse and interpret.
2022-04-01 13:04:22 -07:00
Stephanie Wang
b43426bc33
[core] Add metrics for disk and network I/O (#23546)
Adds some metrics useful for object-intensive workloads:

    Per raylet/object manager:
        Add num bytes pending restore to spill manager
        Add num requests cumulative to PullManager
        Num bytes pushed/pulled from other nodes cumulative
        Histogram for request latencies in PullManager:
            total life time of request, from start to cancel
            request satisfaction time, from start to object local
            pull time, from object activation to object local
    Per-node disk read/write speed, IOPS
2022-04-01 11:15:34 -07:00
Jiajun Yao
c5c5c24e8f
Remove unused ObjectDirectory::LookupLocations() (#23647)
Remove dead code.
2022-04-01 10:03:37 -07:00
shrekris-anyscale
d4747d28eb
[serve] Set "memory" to None in ray_actor_options by default (#23619)
* Make default memory 1

* Add test to validate that ReplicaConfig's default memory cannot be lower than minimum

* Add a new option to memory_omitted_options

* Update if branch in test_replica_config_default_memory_minimum

* Make memory default value None
2022-04-01 09:14:44 -07:00
Kai Fricke
fe27dbcd9a
[air/release] Improve file packing/unpacking (#23621)
We use tarfile to pack/unpack directories in several locations. Instead of using temporary files, we can just use io.BytesIO to avoid unnecessary disk writes.

Note that this functionality is present in 3 different modules - in Ray (AIR), in the release test package, and in a specific release test. The implementations should live in the three modules independently, so we don't add a common utility for this (e.g. the ray_release package should be independent of the Ray package).
2022-04-01 07:38:14 -07:00
Sven Mika
0bb82f29b6
[RLlib] AlphaStar polishing (fix logger.info bug). (#22281) 2022-04-01 09:49:41 +02:00
Lingxuan Zuo
4510c2df90
[Python] export cython module for external project (#23579)
A lot of cython data types have been defined in ray cython module, but outside project cannot reuse these since ray doesn't export all of *.pxd files.

To fix mobius python building error (https://github.com/ray-project/mobius/runs/5740167581?check_suite_focus=true) : no found ray common.pxd, etc. , 

According to cython document https://cython.readthedocs.io/en/latest/src/userguide/source_files_and_compilation.html
we might add this package_data parameter in setup.py
```python
setup(
    package_data = {
        'my_package': ['*.pxd'],
        'my_package/sub_package': ['*.pxd'],
    },
    ...
)
```

Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>
2022-04-01 10:31:33 +08:00
mwtian
1a4c3c07f7
[Dashboard] fix iterating over GPU processes (#23562)
Current logic looks broken, as reported in #22954 (comment)

I fixed the logic as best as I can, and tested it on Anyscale platform with GPU. No process info was reported from gpustat. But the logic works under this case.
2022-03-31 17:16:53 -07:00
Hao Chen
75f1861625
Remove predefined resources vector in ResourceRequest (#23584)
"ResourceRequest" now uses 2 containers: a vector for predefined resources, and a map for custom resources. 
This was intended to be a perf optimization. However, in practice, this makes the code more complex, and, moreover, prevents optimizations for some methods (e.g., "ResourceIds", "Size").

This PR removes the vector and makes ResourceRequest use only one map for all resources. Also, "ResourceIds" now returns a "boost:range" to allow iterating resource IDs without having to construct temporary sets. 

microbenchmark shows a slight perf improvement.
last nightly: `placement group create/removal per second 837.76 +- 16.68`.
this PR: `placement group create/removal per second 895.76 +- 16.99`.
2022-03-31 17:16:11 -07:00
Yi Cheng
5a2ab76af8
[flaky] Release gcs client in test (#23644)
To deflaky gcs_client_test, this PR tries to release the client object.
2022-03-31 16:57:50 -07:00
xwjiang2010
378b66984f
[air] reduce unnecessary stacktrace (#23475)
There are a few changes:
1. Between runner thread and main thread: The same stacktrace is raised in `_report_thread_runner_error` in main thread. So we could spare this raise in runner thread.
2. Between function runner and Tune driver: Do not wrap RayTaskError in TuneError.
3. Within Tune driver code: Introduces a per errored trial error.pkl and uses that to populate ResultGrid.

Plus some cleanups to facilitate propagating exception in runner and executor code.

Final stacktrace looks like: (omitted)

In Tune, we are capturing `traceback.format_exc` at the time the exception is caught and just pass the string around. This PR slightly changes that only in the case of when RayTaskError is raised, and we pass that object around.
It may be worthwhile to settle down on a practice of error handling in Tune in general.
I am also curious to learn how other ray library does that and any good lessons to learn. 

In particular, we should watch out for memory leaking in exception handling. Not sure if it is still a problem in python 3, but here are some articles I came across for reference
https://cosmicpercolator.com/2016/01/13/exception-leaks-in-python-2-and-3/
2022-03-31 22:59:58 +01:00
Yi Cheng
8d7f71601d
deflaky ray syncer test (#23641) 2022-03-31 13:42:30 -07:00
Sven Mika
2eaa54bd76
[RLlib] POC: Config objects instead of dicts (PPO only). (#23491) 2022-03-31 18:26:12 +02:00
mwtian
bd4d6b7e19
[Java] upgrade protobuf-java version (#23627) 2022-03-31 09:12:58 -07:00
Antoni Baum
756d08cd31
[docs] Add support for external markdown (#23505)
This PR fixes the issue of diverging documentation between Ray Docs and ecosystem library readmes which live in separate repos (eg. xgboost_ray). This is achieved by adding an extra step before the docs build process starts that downloads the readmes of specified ecosystem libraries from their GitHub repositories. The files are then preprocessed by a very simple parser to allow for differences between GitHub and Docs markdowns.

In summary, this makes the markdown files in ecosystem library repositories single sources of truth and removes the need to manually keep the doc pages up to date, all the while allowing for differences between what's rendered on GitHub and in the Docs.

See ray-project/xgboost_ray#204 & https://ray--23505.org.readthedocs.build/en/23505/ray-more-libs/xgboost-ray.html for an example.

Needs ray-project/xgboost_ray#204 and ray-project/lightgbm_ray#30 to be merged first.
2022-03-31 08:38:14 -07:00
Andrew Sedler
853f6d6de3
[Bug][Tune] Fix bugs that cause hanging PAUSED trials with PopulationBasedTrainingScheduler (#23472)
As discussed in #23424, the synch=True mode of PopulationBasedTrainingScheduler is (1) not compatible with burn_in_period and (2) causes the presence of TERMINATED trials to hang PAUSED trials indefinitely.

This change addresses (1) by setting the initial _next_perturbaton_sync to the max of burn_in_period and perturbation_interval in the constructor and (2) by checking only whether live trials have reached the _next_perturbation_sync before resuming PAUSED trials.
2022-03-31 08:33:51 -07:00
simonsays1980
9ca9c67bc9
[RLlib] Added dtype safeguards to the 'required_model_output_shape()' methods… (#23490) 2022-03-31 13:52:00 +02:00
Yi Cheng
87dc57df26
[3][cleanup][gcs] Remove redis based pubsub. (#23520)
This PR removes redis based pubsub.
2022-03-31 00:13:55 -07:00
simonsays1980
e4c6e9c3d3
[RLlib] Changed the if-block in the example callback to become more readable. (#22900) 2022-03-31 09:13:04 +02:00
simonsays1980
d2a3948845
[RLlib] Removed the sampler() function in the ParallelRollouts() as it is no needed. (#22320) 2022-03-31 09:06:30 +02:00
Yi Cheng
df3e761b18
[gcs] Change old syncer to gcs_syncer namespace (#23623)
Before integration of the newly introduced ray_syncer, there is a conflict in naming. This PR move the old ray syncer to another namespace.
2022-03-30 21:54:02 -07:00
ZhuSenlin
e79a63db64
[GCS] [3 / n] Refactor gcs_resource_scheduler to cluster_resource_scheduler #23371
As we (@scv119 @iycheng @raulchen @Chong-Li @WangTaoTheTonic ) discussed offline, the GcsResourceScheduler on the GCS side should be unified to ClusterResourceScheduler.

There is already a big PR( #23268 ) to do this, but in order to make review easy, I will split it to two or mall small PRs.
This is [3/n]:

Move the implementation of all policies from gcs_resource_scheduler to bundle_scheduling_plocy
Delete gcs_resource_scheduler
Refactor gcs_resource_scheduler_test to cluster_resource_scheduler_2_test
BTW: The interface inside ISchedulingPolicy should be refactor in another PR, see the discussion #23323 (comment)

To be clear:

scorer related codes are moved out from gcs_resoruce_scheduler to scorer.h/.cc and no logic changes.
Policy related codes are moved out from gcs_resoruce_scheduler to bundle_scheduling_policy.h/.cc, and a small part of the logic in "GcsResourceScheduler::Schedule" is distributed into each policy.
Some codes inside gcs_placement_group_scheduler.h/.cc are changed to adapt to new data structure (SchedulingResult and SchedulingContext)
2022-03-30 17:39:46 -07:00
Avnish Narayan
161d95c31b
[RLlib] Increase slateq workers to decrease runtime on prod (#23609) 2022-03-30 17:38:21 -07:00
Jiao
d7e77fc9c5
[DAG] Serve Deployment Graph documentation and tutorial. (#23512) 2022-03-30 17:32:16 -07:00
Yi Cheng
31483a003a
[syncer] skip ray_syncer_test on windows temporarily (#23610)
ray_syncer_test is flaky on windows. It's not so easy to investigate what's happening there. The test timeout somehow.
We disable it for short time.
2022-03-30 17:29:08 -07:00
Yi Cheng
d01f947ff1
[gcs] Make core worker test compilable. (#23608)
It seems like core worker test is not running and it breaks the build. This PR fixed this.
2022-03-30 17:26:38 -07:00
Chen Shen
944e8e1053
Revert "[Python Worker] load actor dependency without importer thread (#23383)" (#23601)
This reverts commit d1ef498638.
2022-03-30 15:45:00 -07:00
Chen Shen
3e80da7e9f
[ci/release] long running / change failed test to sdk (#23602)
close #23592. Talking with @krfricke and he suggested we move to use sdk for those long running tasks.
2022-03-30 12:57:21 -07:00
Kai Fricke
3b3f271c3c
[tune] Fix tensorflow distributed trainable docstring (#23590)
The current docstring was copied from horovod and still refers to it.
2022-03-30 11:36:45 -07:00
Jiajun Yao
2959294f02
[CI] Filter release tests by attr regex (#23485)
Support filtering tests by test attr regex filters. Multiple filters can be specified with one line for each filter. The format is attr:regex (e.g. team:serve)
2022-03-30 09:41:18 -07:00
jon-chuang
54ddcedd1a
[Core] Chore: move test to right dir #23096 2022-03-30 09:29:38 -07:00
Kai Fricke
e8abffb017
[tune/release] Improve Tune cloud release tests for durable storage (#23277)
This PR addresses recent failures in the tune cloud tests.

In particular, this PR changes the following:

    The trial runner will now wait for potential previous syncs to finish before syncing once more if force=True is supplied. This is to make sure that the final experiment checkpoints exist in the most recent version on remote storage. This likely fixes some flakiness in the tests.
    We switched to new cloud buckets that don't interfere with other tests (and are less likely to be garbage collected)
    We're now using dated subdirectories in the cloud buckets so that we don't interfere if two tests are run in parallel. Objects are cleaned up afterwards. The buckets are configured to remove objects after 30 days.
    Lastly, we fix an issue in the cloud tests where the RELEASE_TEST_OUTPUT file was unavailable when run in Ray client mode (as e.g. in kubernetes).

Local release test runs succeeded.

https://buildkite.com/ray-project/release-tests-branch/builds/189
https://buildkite.com/ray-project/release-tests-branch/builds/191
2022-03-30 09:28:33 -07:00
Kai Fricke
b0fc631dea
[docs/tune] Fix PTL multi GPU link (#23589)
Broken in current docs
2022-03-30 09:24:48 -07:00
Yi Cheng
781c46ae44
[scheduling][5] Refactor resource syncer. (#23270)
## Why are these changes needed?

This PR refactor the resource syncer to decouple it from GCS and raylet. GCS and raylet will use the same module to sync data. The integration will happen in the next PR.

There are several new introduced components:

* RaySyncer: the place where remote and local information sits. It's a coordinator layer.
* NodeState: keeps track of the local status, similar to NodeSyncConnection.
* NodeSyncConnection: keeps track of the sending and receiving information and make sure not sending the information the remote node knows.

The core protocol is that each node will send {what it has} - {what the target has} to the target.
For example, think about node A <-> B. A will send all A has exclude what B has to B.

Whenever when there is new information (from NodeState or NodeSyncConnection), it'll be passed to RaySyncer broadcast message to broadcast. 

NodeSyncConnection is for the communication layer. It has two implementations Client and Server:

* Server => Client: client will send a long-polling request and server will response every 100ms if there is data to be sent.
* Client => Server: client will check every 100ms to see whether there is new data to be sent. If there is, just use RPC call to send the data.

Here is one example:

```mermaid
flowchart LR;
    A-->B;
    B-->C;
    B-->D;
```

It means A initialize the connection to B and B initialize the connections to C and D

Now C generate a message M:

1. [C] RaySyncer check whether there is new message generated in C and get M
2. [C] RaySyncer will push M to NodeSyncConnection in local component (B)
3. [C] ServerSyncConnection will wait until B send a long polling and send the data to B
4. [B] B received the message from C and push it to local sync connection (C, A, D)
5. [B] ClientSyncConnection of C will not push it to its local queue since it's received by this channel.
6. [B] ClientSyncConnection of D will send this message to D
7. [B] ServerSyncConnection of A will be used to send this message to A (long-polling here)
8. [B] B will update NodeState (local component) with this message M
9. [D] D's pipelines is similar to 5) (with ServerSyncConnection) and 8)
10. [A] A's pipeline is similar to 5) and 8)
2022-03-29 23:52:39 -07:00
Eric Liang
5aead0bb91
Warn if the dataset's parallelism is limited by the number of files (#23573)
A common user confusion is that their dataset parallelism is limited by the number of files. Add a warning if the available parallelism is much less than the specified parallelism, and tell the user to repartition() in that case.
2022-03-29 18:54:54 -07:00
xwjiang2010
6443f3da84
[air] Add horovod trainer (#23437) 2022-03-29 18:12:32 -07:00
Matti Picus
e58b784ac7
WINDOWS: fix pip activate command (#23556)
Continuation of #22449 

Fix pip activation so something like this will not crash
```
ray.init(runtime_env={"pip": ["toolz", "requests"]})
```

Also enable test that hit this code path.
2022-03-29 17:51:20 -07:00
Amog Kamsetty
0b8c21922b
[Train] Improvements to fault tolerance (#22511)
Various improvements to Ray Train fault tolerance.

Add more log statements for better debugging of Ray Train failure handling.
Fixes [Bug] [Train] Cannot reproduce fault-tolerance, script hangs upon any node shutdown #22349.
Simplifies fault tolerance by removing backend specific handle_failure. If any workers have failed, all workers will be restarted and training will continue from the last checkpoint.
Also adds a test for fault tolerance with an actual torch example. When testing locally, the test hangs before the fix, but passes after.
2022-03-29 15:36:46 -07:00
Stephanie Wang
da7901f3fc
[core] Filter out self node from the list of object locations during pull (#23539)
Running Datasets shuffle with 1TB data and 2k partitions sometimes times out due to a failed object fetch. This happens because the object directory notifies the PullManager that the object is already on the local node, even though it isn't. This seems to be a bug in the object directory.

To work around this on the PullManager side, this PR filters out the current node from the list of locations provided by the object directory. @jjyao confirmed that this fixes the issue for Datasets shuffle.
2022-03-29 15:18:14 -07:00
Kai Fricke
922367d158
[ci/release] Fix smoke test compute templates (#23561)
The smoke test definitions of a few tests were faulty for compute template override.

Core tests @rkooo567: https://buildkite.com/ray-project/release-tests-branch/builds/294
2022-03-29 13:48:09 -07:00
Gagandeep Singh
3856011267
[Serve] [Test] Bytecode check to verify imported function correctness in test_pipeline_driver::test_loading_check (#23552) 2022-03-29 13:18:25 -07:00
Linsong Chu
2a6cbc5202
[workflow]align the behavior of workflow's options() with remote function's options() (#23469)
The current behavior of workflow's `.options()` is to **completely rewrite all the options** rather than **update options**, this is less intuitive and inconsistent with the behavior of `.options()` in remote functions.

For example:
```
# Remote Function
@ray.remote(num_cpus=2, max_retries=2)
f.options(num_cpus=1)
```
`options()` here **updated** num_cpus while **the rest options are untouched**, i.e. max_retires is still 2. This is the expected behavior and more intuitive.

```
# Workflow Step
@workflow.step(num_cpus=2, max_retries=2)
f.options(num_cpus=1)
```
`options()` here **completely drop all existing options** and only set num_cpus, i.e. previous value of max_retires (2) is dropped and reverted to default (3).  This will also drop other fields like `name` and `metadata` if name and metadata are given in the decorator but not in the options().
2022-03-29 12:35:04 -07:00
Yi Cheng
61c9186b59
[2][cleanup][gcs] Cleanup GCS client options. (#23519)
This PR cleanup GCS client options.
2022-03-29 12:01:58 -07:00
Simon Mo
cb1919b8d0
[Doc][Serve] Add minimal docs for model wrappers and http adapters (#23536) 2022-03-29 11:33:14 -07:00
Kai Fricke
afd287eb93
[ci] linkcheck should soft fail (#23559)
Linkcheck failures should not break the build.
2022-03-29 10:57:03 -07:00
Philipp Moritz
005ea36850
[linkcheck] Remove flaky url (#23549) 2022-03-29 08:36:54 -07:00
Artur Niederfahrenhorst
9a64bd4e9b
[RLlib] Simple-Q uses training iteration fn (instead of execution_plan); ReplayBuffer API for Simple-Q (#22842) 2022-03-29 14:44:40 +02:00
Jun Gong
a7e5aa8c6a
[RLlib] Delete some unused confusing logics. (#23513) 2022-03-29 13:45:13 +02:00
Hao Chen
b7d32df8b0
Refactor scheduler data structures (#22854)
This is the first PR to refactor scheduler data structures (See #22850).

Major changes:
- Hid the implementation details in the `ResourceRequest` and `TaskResourceInstnaces` classes, which expose public methods such as algebra operators and comparison operators. 
- Hid the differences between "predefined" and "custom" resources inside these 2 classes. Call sites can simply use the resource ID to access the resource, no matter it is predefined or custom.
- The predefined_resources vector now always has the full length. So no more "resize"s are needed. 
- Removed the `ResourceCapacity` class. Now "total" and "available" resources are stored in separate fields in "NodeResources". 
- Moved helper functions for FixedPoint vectors from "cluster_resource_data.h" to "fixed_point.h"
- "ResourceID" now has static methods to get the resource ids of predefined resources, e.g. "ResourceID::CPU()". 
- Encapsulated unit-instance resource logic to "ResourceID"

Other planned changes that are not included in this PR:
- Rename ResourceRequest to ResourceSet, and move it to its own file.
- Remove the predefined vectors and always use maps.

Co-authored-by: Chong-Li <lc300133@antgroup.com>
2022-03-29 19:44:59 +08:00