Commit graph

11031 commits

Author SHA1 Message Date
SangBin Cho
6b4aac7a08
Promote unstable tests to stable (#21811)
Promote tests that have passed 100% last 1 week to stable
2022-01-24 02:10:37 -08:00
Chen Shen
a60251f47a
[Core] Fix 16GB mac perf issue by limit the plasma store size to 2GB (#21224)
* add changes

* as title

* fix

* max to min

* fix tests
2022-01-24 01:52:59 -08:00
Shawn
6603ad450a
[Java] print hang test case name (#21804)
* print hang test case name

* use getFullTestName
2022-01-23 23:56:44 -08:00
SangBin Cho
1ae14ec513
[Dashboard] Make dashboard / agent work in minimal ray installation 1/3. (#21774)
This is the doc that explains how to achieve this: https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit?usp=sharing

The fully working e2e prototype is here (it passes all tests): cdad913883

This PR is pure refactoring. Basically it moves some of util functions that require optional_deps to `optional_utils` so that optional deps' util functions are not used in the minimal installation. Look below to see the steps. 

<img width="693" alt="Screen Shot 2022-01-21 at 4 38 44 AM" src="https://user-images.githubusercontent.com/18510752/150528494-c3cdedf4-3a66-4557-b540-61436b1dbab6.png">
2022-01-23 21:11:32 -08:00
SangBin Cho
babc03edf2
Add a threaded actor k8s test (#21739)
Add threaded actor flaky test to k8s.
2022-01-23 20:12:57 -08:00
DK.Pino
8cd7a5c438
[Placement Group] Make placement group commit resource rpc request batched (#21240)
This is one part of this refactor, #20715 , make the commit resource RPC requests batched per node.
2022-01-23 06:16:09 -08:00
Jiao
5d382cfeb3
[nit] remove decorator in test_cli.py (#21792)
Full context see https://github.com/ray-project/ray/issues/21791

pytest work for "some" environments for this test and on CI master, but this decorator is still unnecessary and was introduced by mistake. So just remove it and see what happens with the original issue.
2022-01-23 06:05:05 -08:00
Lingxuan Zuo
ec62d7f510
[Streaming]Farewell : remove all of streaming related from ray repo. (#21770)
New repo url is https://github.com/ray-project/mobius

Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>
2022-01-23 17:53:41 +08:00
Gagandeep Singh
2da2ac52ce
Unskipped test_worker_stdout (#21708) 2022-01-22 02:43:03 -08:00
Qing Wang
a37d9a2ec2
[Core] Support default actor lifetime. (#21283)
Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item.
#### API Change
The Python API looks like:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
```

Java API looks like:
```java
  System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name());
  Ray.init();
```

One example usage is:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
  a1 = A.options(lifetime="non_detached").remote()   # a1 is a non-detached actor.
  a2 = A.remote()  # a2 is a non-detached actor.
```

Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2022-01-22 12:26:08 +08:00
Gagandeep Singh
b00385f9a2
Using a deterministic approach to check connections in test_job_timestamps (#21693) 2022-01-21 20:18:05 -08:00
Archit Kulkarni
c4bf68a083
[runtime env] Short-circuit protobuf serialization/deserialization for empty runtime envs (#21788)
* Fix check for empty runtime_env ("{}", not "")

* also check for ""
2022-01-22 13:07:03 +09:00
xwjiang2010
0abcd5eea5
[tune] only sync up and sync down checkpoint folder for cloud checkpoint. (#21658)
By default, ~/ray_results/exp_name/trial_name/checkpoint_name.
Instead of the whole trial checkpoint (~/ray_results/exp_name/trial_name/) directory.
Stuff like progress.csv, result.json, params.pkl, params.json, events.out etc are coming from driver process.
This could also enable us to de-couple sync up and delete - they don't have to wait for each other to finish.
2022-01-21 17:56:05 -08:00
mwtian
e8ce01c525
[Dashboard] offload blocking work to a thread pool (#21762)
Currently, GCS KV client only has blocking API. Calling them from dashboard event loop can block other operations for many seconds, leading to failures such as taking too long (> 2min) to submit a job and making nightly tests fail (#21699). This PR offloads the blocking work to a separate thread. Implementing async GCS KV API will be done in the future.
2022-01-21 17:55:11 -08:00
matthewdeng
8119b62640
[train] refactor callback logdir and results preprocessors (#21468)
* [train] Add TorchTensorboardProfilerCallback and introduce ResultsPreprocessors

* simplify profiler

* read on get_and_clear_profile_traces

* refactor callbacks

* remove var

* Update python/ray/train/callbacks/logging.py

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

* Update python/ray/train/callbacks/results_prepocessors/keys.py

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

* address comments; add tests

* fix test

* address comments

* docs

* address comments'

* fix test

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-01-21 17:23:34 -08:00
matthewdeng
165a025641
[train] update worker batch size docs (#21761)
Making it explicit how the user should think about batch size for PyTorch in a distributed setting, similar to what's already done for TensorFlow.

![image](https://user-images.githubusercontent.com/3967392/150421340-df73f574-8531-4626-88a6-b80442ea6b7f.png)
2022-01-21 17:22:47 -08:00
Junwen Yao
216c4bf9a6
[Serve] warn when serve.start() with different options (#21562) 2022-01-21 15:51:26 -08:00
Max Pumperla
f9b71a8bf6
[docs] new structure (#21776)
This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way:

- [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign.
- [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).
2022-01-21 15:42:05 -08:00
shrekris-anyscale
45eebdd6e3
[Serve] Handle name collisions when unzipping directories in unzip_package (#21723)
Currently, the `unzip_package` function relies on `extract_file_and_remove_top_level_dir` to unzip and remove the top-level directory from archive working directories. However, `extract_file_and_remove_top_level_dir` uses `os.rename()` to remove the tld by manually unzipping each file from a zip file and moving it to the tld's parent. When the tld contains directories or files with the same name as the tld, `os.rename()` fails to move these files to the tld's parent because of the name collision between the file and the tld.

This change replaces `extract_file_and_remove_top_level_dir` with `remove_dir_from_filepaths`. Now, `unzip_package` unzips the entire zip file before `remove_dir_from_filepaths` moves all the tld's children to the tld's parent using `os.rename()`.

This edge case is tested in the new unit test `test_unzip_with_matching_subdirectory_names`. Additionally, `extract_file_and_remove_top_level_dir`'s unit test is replaced with `TestRemoveDirFromFilepaths`, which tests the new `remove_dir_from_filepaths` function.
2022-01-21 15:27:28 -06:00
Adam Golinski
2954bf9a48
[docs][tune] Fix typo in schedulers.rst (#21777)
Fix typo in schedulers.rst
2022-01-21 13:21:01 -08:00
Chen Shen
c401e94d2e
[resource-reporting] refactor resource scheduler 1/n #21732 2022-01-21 12:39:21 -08:00
shrekris-anyscale
75b3080834
[Serve] Serve Autoscaling Release tests (#21208) 2022-01-21 12:08:25 -08:00
Clark Zinzow
2cd3045b16
[Test Infra] Fix e2e.py help info for --report (#21757)
This momentarily confused me as to whether --report would enable or disable reporting.
2022-01-21 03:29:50 -08:00
mwtian
c85546a884
[Test] increase timeout for test_traceback.py (#21765)
`test_traceback.py` was taking ~55s to finish recently, and since today it starts to time out at 60s more frequently. All test cases do succeed so increase its test time out for now. We will look into if there is any performance regression separately.
2022-01-20 23:57:49 -08:00
SangBin Cho
5514711a35
[Part 5] Set actor died error message in ActorDiedError (#20903)
This is the second last PR to improve `ActorDiedError` exception. 

This propagates Actor death cause metadata to the ray error object. In this way, we can raise a better actor died error exception.

After this PR is merged, I will add more metadata to each error message and write a documentation that explains when each error happens. 

TODO
- [x] Fix test failures
- [x] Add unit tests
- [x] Fix Java/cpp cases

Follow up PRs
- Not allowing nullptr for RayErrorInfo input.
2022-01-20 22:11:11 -08:00
SangBin Cho
9728d98586
[Internal Observability] Event stats segfault due to thread safety 2/2 (#21737)
This fixes the event stats segfault. I made a consistent repro script that can have 3~5 segfault per each `placement_group_mini_integration_test` and verified this fixes the issue by running it more than 5 times.

Note that when GCS HA is on, the same repro uncovered a different shutdown bug from the pubsub (resubscription seems to happen, and currently idempotency is not supported). I will make a separate issue so that @mwtian can handle it.

## Root cause
The problem was that we are calling `gcs_client_->Disconnect()` from the task execution event loop. Note that gcs_client_ uses io_service as its main event loop, meaning now Disconnect is called in a different thread.

If you look at Disconnect method, it resets lots of pointers. Resetting pointers of attributes that can be used by other threads -> segfault.

This fixes the issue by "not resetting pointers". We do this after we join the io service thread, cuz it can now guarantee the gcs client won't be used by other threads (resetting gcs_client_ is probably not necessary, but I added it as a safety check).

Note that gcs_client is currently written in very complicated way, and we need some refactoring to make thread-safety & code structure cleaner. But I will do this task once I have bandwidth.
2022-01-20 22:10:34 -08:00
mwtian
f18a8bd87f
[Dashboard] turn a noisy info log into debug (#21746)
Currently, dashboard log contains many repeated entries like `Received a log for 172.31.47.219 and autoscaler` which is too noisy.
2022-01-20 22:08:23 -08:00
Lingxuan Zuo
43ea467896
Ray support internal native deps reused (#21641)
To make other system or internal project reuse ray deps bazel function, we need change this local accessing style to global accessing with ray-project namespace.

Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>
2022-01-21 13:56:40 +08:00
qicosmos
7172802d8c
[C++ Worker][xlang] Support calling python worker (#21390)
C++ API need to call python and java worker, this pr support call python worker. Call python worker is similar with call c++ worker, need to pass PyFunction, PyActorClass and PyActorMethod.

## call python normal task
```python
#test_cross_language_invocation.py
import ray

@ray.remote
def py_return_input(v):
    return v
```

c++ api call python function
```c++
  auto py_obj1 = ray::Task(ray::PyFunction</*ReturnType*/int>{/*module_name=*/"test_cross_language_invocation",
                                                     /*function_name=*/"py_return_input"})
                     .Remote(42);
  EXPECT_EQ(42, *py_obj1.Get());
```
The user need to fill python module name and function name, then pass arguments into the remote.
The user also need to assign the return type and arguments types of the python function, it used to do static safe checking and get result.

## call python actor task
```python
#test_cross_language_invocation.py
@ray.remote
class Counter(object):
    def __init__(self, value):
        self.value = int(value)

    def increase(self, delta):
        self.value += int(delta)
        return str(self.value)
```
c++ api call python actor function
```c++
  // Create  python actor
  auto py_actor_handle =
      ray::Actor(ray::PyActorClass{/*module_name=*/"test_cross_language_invocation",  /*class_name=*/"Counter"})
          .Remote(1);
  EXPECT_TRUE(!py_actor_handle.ID().empty());

  // Call python actor task
  auto py_actor_ret =
      py_actor_handle.Task(ray::PyActorMethod</*ReturnType*/std::string>{/*actor_function_name=*/"increase"}).Remote(1);
  EXPECT_EQ("2", *py_actor_ret.Get());
```
The user need to fill python module name and class name when creating python actor.

PyActorMethod only need to fill the function name.

It's also similar with calling c++ actor task, also has compile-time safe checking.
2022-01-21 13:55:30 +08:00
xwjiang2010
c22a9fa731
[Docs] pin ray lightning version to fix lint error. (#21764) 2022-01-20 17:52:24 -08:00
mwtian
0dbe4b3a56
[Pubsub] fix driver warning for not keeping up with worker logs (#21717)
GCS pubsub uses long polling, so the subscriber waits instead of returning None from polling when there is no buffered log. It needs a different heuristic to decide if the driver is not keeping up with logs from the worker.
2022-01-20 16:32:42 -08:00
Richard Liaw
0da693c2e3
[codeowners] rllib fix (#21748) 2022-01-20 16:29:35 -08:00
Yi Cheng
9a88a60f6a
[gcs/ha] Fix global_state_accessor_test to support redisless mode (#21729)
This PR enables global_state_accessor_test to support redisless mode so that we can enable the flag by default in the future.
2022-01-20 15:32:49 -08:00
xwjiang2010
9af8f11191
Revert "[docs] Clean up doc structure (first part) (#21667)" (#21763)
This reverts commit 38e46c9fb3.
2022-01-20 15:30:56 -08:00
Yi Cheng
90093769df
[nightly] Add more many tasks tests (#21727)
This PR add four tests for many tasks:

many short tasks send from the single node
many short tasks send from multiple nodes
many long tasks send from multiple nodes
many long tasks send from the single node
TODO: migrate many nodes actor tests to this one.

scheduling envelop should contain:

(tasks): scheduling_test_many_xx_tasks_yy_nodes
(actors):many_nodes_actor_test (to be combined with this one)
(shuffle): pipelined_ingestion_1500_gb_15_windows
(shuffle): dask_on_ray_1tb_sort
2022-01-20 14:52:26 -08:00
Yi Cheng
3c63a8410d
[gcs/ha] Fix java related error when enable redisless ray (#21692)
This PR enables ray java to be able to run without redis. It also fixes java related tests and updated the pipeline.
2022-01-20 13:56:25 -08:00
Archit Kulkarni
f058a1d342
[Jobs] Stream logs during job instead of only at the end (#21659)
Closes https://github.com/ray-project/ray/issues/21517
2022-01-20 15:21:07 -06:00
Jiajun Yao
fa9feb5033
Fix replace_symlinks_with_junctions for windows (#21720)
Windows cmd.exe doesn't interpret single quote correctly. See https://github.com/conda-forge/ray-packages-feedstock/pull/43
2022-01-20 12:38:56 -08:00
matthewdeng
976ba5dbfe
[train] fix fashion mnist example (#21689)
Making some minor fixes.

1. Update input `batch_size` to be global batch size. Introduce `worker_batch_size` so each iteration trains same global batch size.
2. Update dataset `size` calculation to only refer to the fraction of the data that is trained on each worker. This allows calculations (e.g. training progress, accuracy) to be correct.
3. Add `model.train()` for generality.
4. Remove `smoke-test` flag since it's not really being used.
2022-01-20 12:26:02 -08:00
SangBin Cho
b6d3e01e0b
Revert "WINDOWS: enable passing metric tests (#21705)" (#21738)
This reverts commit 8104fd5c76.
2022-01-20 07:27:49 -08:00
Max Pumperla
38e46c9fb3
[docs] Clean up doc structure (first part) (#21667) 2022-01-20 16:19:04 +01:00
mwtian
a4581e58ee
[Pubsub] improve error handling for GCS AIO subscribers in dashboard (#21712)
- Tolerate GRPC deadline exceeded and transient failures in Python GCS AIO subscribers, which becomes consistent with Python GCS synchronous subscribers.
- Tolerate any exception in dashboard for subscribing to logs and error info, which becomes consistent with how dashboard handles GRPC errors for obtaining node stats.
2022-01-20 07:04:54 -08:00
Sven Mika
c4636c7c05
[RLlib] Issue 21633: SimpleQ should not use a prio. replay buffer. (#21665) 2022-01-20 11:46:25 +01:00
Eric Liang
5065156dd9
Set task retry delay to zero (#21690) 2022-01-19 23:41:35 -08:00
SangBin Cho
e3357eb9e5
[Internal Observability] Fix the event stats segfault 1/2 (#21593)
This PR is a pre-work before actually fixing a thread-safety bug within shutdown.

It is doing
- Add better logging upon core worker shutdown.
- Improve document around core worker shutdown.
- Remove unnecessary pointer usage from periodical runner for clean destruction order.
- Remove unnecessary `WaitForShutdown` API and combine them into a single `Shutdown` API.
2022-01-19 23:08:54 -08:00
Shantanu
ae60548ef3
Silence "cut: write error: Broken pipe" log spew (#21686)
On machines without GPUs, this can run subprocesses that spew to
stderr. Then with log_to_driver=True, we get log spew from every
single raylet. To avoid this, disable the GPU usage check on
certain errors.

Resolves #14305

Co-authored-by: hauntsaninja <>
2022-01-19 23:01:10 -08:00
Hao Chen
8dcc07ec9c
[Fix][Locality] ref count should remove object locations for dead nodes (#21548)
When a node is dead, reference table should remove locations for those objects on the node. Otherwise locality-aware scheduling will schedule tasks to the dead node.
2022-01-20 11:58:52 +08:00
Philipp Moritz
fbc51d6d0e
[Kuberay] Ray Autoscaler integration with Kuberay (MVP) (#21086)
This is a minimum viable product for Ray Autoscaler integration with Kuberay. It is not ready for prime time/general use, but should be enough for interested parties to get started (see the documentation in kuberay.md).
2022-01-19 19:42:17 -08:00
Archit Kulkarni
7d74a9face
[doc] add Ray versions 1.9.1 - 1.10.0 to dask on ray compatibility table (#21360)
I updated this version compatibility table on the release branch but didn't update it on master.  This is my mistake, the process is to make a PR to master and then cherry pick that commit to the release branch.
2022-01-19 18:55:05 -08:00
Wilson Wang
2626c64060
Fix monitor.py exceptions. Enable fetching GCS address from Redis with retries. (#21533)
GCS, when running as an individual component, can cause other components to fail in case of crashes. 

Here are two main cases covered in this patch:

1. monitor.py will raise an exception when disconnected from GCS.
2. When GCS becomes available later than other components, the missing KV of GCS address can cause other components to fail to start.


In our patch, we fixed these two issues as well as increased the timeout for redis connection which was too small.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-19 18:48:03 -08:00