Commit graph

5917 commits

Author SHA1 Message Date
Matti Picus
f4da0410b3
WINDOWS: unskip actor, component_failure, failure tests (#21492)
Unskip windows tests that pass locally
2022-01-13 23:16:22 -08:00
Stephanie Wang
1df67eb977
[core] Avoid ObjectID collisions for re-executed tasks (#21395)
If a task is re-executed on failure, it will deterministically generate the same IDs for any ray.put or .remote task calls because it uses its own task ID as a seed. This can cause problems if those objects conflict with previous versions that still exist in the cluster.

This PR adds the execution attempt number to the current task ID seed. This avoids collisions with any ObjectIDs generated by the previous execution attempt of the task.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-01-13 18:18:55 -08:00
Yi Cheng
e4ba51f25b
[core] Add GC for function table (#21509)
In Ray, functions are exported to the function table during runtime. But it's not cleaned up after use. This PR garbage collects the resource when there is no job/detached actor referencing the resource.

Ideally, we should move the function table imports/exports feature to core, so gcs function manager is introduced, and currently, it's for reference counting only.
2022-01-13 18:06:05 -08:00
Yi Cheng
6dccfbffa9
Revert "Revert "[gcs] turn on grpc pubsub by default"" (#21585)
Reverts ray-project/ray#21584 and turn the flag off
2022-01-13 16:12:03 -08:00
mwtian
30968a9358
[GCS] support external Redis in GCS bootstrapping mode (#21436)
External Redis should still be supported with GCS bootstrapping, to avoid breaking users.
In GCS mode, some logic are removed for external Redis:
- Printing external Redis addresses to terminal: hard to implement across `ray start`, `ray.init()` and Ray cluster util.
- Starting local Redis if external Redis is unavailable: failing loudly here seems more appropriate.

Also, re-enable a few tests which restarts GCS in GCS bootstrapping mode, by using external Redis for KV storage.
2022-01-13 16:01:11 -08:00
Jiajun Yao
d6dbf3b8bf
[scheduler] Set default max_pending_lease_requests_per_scheduling_category to 10 (#20404) 2022-01-13 13:50:56 -08:00
Yi Cheng
bc696212d2
Revert "[gcs] turn on grpc pubsub by default" (#21584)
test-reconnect seems flaky.
Reverts ray-project/ray#21513
2022-01-13 12:34:02 -08:00
mwtian
cf6a54ca46
[CI] pin pytest-asyncio (#21579) 2022-01-13 11:35:30 -08:00
Kai Fricke
a3442df584
[ci/multinode] Build multinode image with OpenSSH before running tests (#21544)
Currently we install OpenSSH on the fly in fake multinode docker testing. Instead we can speed testing up a fair bit by building a Docker image which includes OpenSSH first and then run tests with this image.
2022-01-13 08:47:04 -08:00
Gagandeep Singh
d392f97331
Unskipped tests in serve: test_controller_recovery.py (#21450) 2022-01-13 01:09:59 -08:00
Yi Cheng
6194783312
[gcs] turn on grpc pubsub by default (#21513)
Turn on grpc pubsub by default.  This PR also fixed several tests which are failed before.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-12 22:13:03 -08:00
Gagandeep Singh
13f20e5e1e
[Serve] Unskipped tests in test_pipeline.py (#21484) 2022-01-12 13:56:50 -08:00
Kai Fricke
c1d4c22351
[ci/multinode] Follow-up fix for resource popping (#21543)
The previous change in #21531 unfortunately added the fix at the wrong location. This PR corrects this.
2022-01-11 17:47:42 -08:00
Antoni Baum
6ba4513777
[tune] Load experiment into searcher (#21506)
This PR adds a new method to the Searcher class, add_evaluated_trials. This method wraps around add_evaluated_point and allows the user to pass a Trial, list of Trials or ExperimentAnalysis to load into the searcher. Furthermore, this PR updates the HEBO version to the latest and removes outdated documentation, and adds add_evaluated_point methods to Dragonfly and SkOpt searchers.
2022-01-11 15:58:20 -08:00
Matti Picus
ec6a33b736
[tune] fixes to allow tune/tests/test_commands.py to run on windows (#21342)
tune does not run smoothly on Windows. This cleans up some blockers:
- use cross-platform shutils.get_terminal_size instead of Popen(stty)
- somehow Trainer.workers is None at the end of test_commands.py, so the cleanup command was erroring. The error was not fatal, but was printing in the logs.
- if run locally, the log files are all written to the same location, so the rync-based syncing solution is not needed. This is the real fix for issue #20747
2022-01-11 15:57:20 -08:00
mwtian
45cddef2d3
[GCS] disable tests related to GCS restarting in GCS pubsub mode (#21534)
`test_failure_2.py::test_gcs_server_failiure_report` and `test_gcs_fault_tolerance.py::test_gcs_server_restart_during_actor_creation` cannot pass in GCS pubsub mode with the existing logic. Disable these tests in GCS pubsub mode and add comment about how we may fix them.

Also, suppress exceptions when sync subscribers are disconnected from GCS.

I can push changes in this PR to #21513 as well.
2022-01-11 14:14:05 -08:00
Kai Fricke
084bda87a5
[ci/multinode] Fix resource popping resulting in empty resource head nodes (#21531)
Fixes a small bug where we pop from the resources dict without making a copy, emptying the head node resources. This sometimes leads to empty head node resources.
2022-01-11 13:20:58 -08:00
Yi Cheng
d2d749b6f9
[workflow] Fix test_serialization.py (#21522)
The new version of responses will introduce some errors in the test. This PR fixed responses.

It also fixed moto in case of future updates upstream.
2022-01-11 11:45:18 -08:00
mwtian
0e5de61c18
remove unnecessary test filter (#21510)
(Comment from the PR:)
If a GRPC call exceeds timeout, the calls is cancelled at client side but server may still reply to it, leading to missed messages and test failures. Using a sequence number to ensure no message is dropped can be the long term solution,
but its complexity and the fact the Ray subscribers do not use deadline in production makes it less preferred.
Therefore, a simpler workaround is used instead: a different subscriber is used for each get_error_message() call.

Also, re-enable some additional tests in GCS HA mode.
2022-01-11 10:17:03 -08:00
Gagandeep Singh
d47b82883a
Unskipped non-cluster tests in test_actor_resources.py (#21500) 2022-01-11 09:46:03 -08:00
Gagandeep Singh
a5a8156198
Unskipped tests in test_actor_failures (#21498) 2022-01-11 09:42:12 -08:00
Gagandeep Singh
e8df34af08
Unskipped test in test_autoscaling_policy (#21497)
The test passes on my Windows Azure VM. 

P.S. Is it related to cluster tests? I am not sure.
2022-01-11 09:40:37 -08:00
Eric Liang
9ac34ecc94
Revert "[workflow] Skip saving outputs of "workflow.wait"" (#21520)
This is breaking linux://python/ray/workflow:tests/test_wait per https://flakey-tests.ray.io/
2022-01-10 20:51:42 -08:00
Kai Fricke
5a7f6e4fdd
[rfc][ci] create fake docker-compose cluster environment (#20256)
Following #18987 this PR adds a docker-compose based local multi node cluster.

The fake multinode docker comprises two parts. The docker_monitor.py script is a watch script calling docker compose up whenever the docker-compose.yaml changes. The node provider creates and updates the docker compose according to the autoscaling requirements.

This mode fully supports autoscaling and comes with test utilities to start and connect to docker-compose autoscaling environments. There's also a sample test case showing how this can be used.
2022-01-11 04:35:36 +00:00
Gagandeep Singh
4a8a8b30b0
Skipped test_reference_counting_2 and test_actor (#21507) 2022-01-10 20:34:03 -08:00
hckuo
7955333ffd
[runtime env] allow working_dir to be a zipped package (#20826)
Check if working_dir is a zip, unzip it if so.
2022-01-10 18:29:01 -06:00
Siyuan (Ryans) Zhuang
6e568d2c02
[workflow] Skip saving outputs of "workflow.wait" (#21183) 2022-01-10 15:37:13 -08:00
Amog Kamsetty
bcae6ba6c9
[Train] _WrappedDataLoader yield tuples (#21467)
Fixes bug with _WrappedDataLoader that yields a generator instead of a tuple.

Addresses https://discuss.ray.io/t/ray-train-creates-typeerror-generator-object-is-not-subscriptable/4605/10
2022-01-10 12:40:36 -08:00
Zyiqin-Miranda
71fae21e8e
[autoscaler] AWS Autoscaler CloudWatch Dashboard support (#20266)
These changes add a set of improvements to enable automatic creation and update of CloudWatch dashboards when provisioning AWS Autoscaling clusters. Successful implementation of these improvements will allow AWS Autoscaler users to:

1. Get rapid insights into their cluster state via CloudWatch dashboards.
2. Allow users to update their CloudWatch dashboard JSON configuration files during Ray up execution time.

Notes:
1.  This PR is a follow-up PR for #18619, adds dashboard support.
2022-01-10 10:18:53 -08:00
Gagandeep Singh
6420c75fd2
Unskipped test in test_advanced_2.py (#21503) 2022-01-10 09:06:44 -08:00
Matti Picus
5aef1e1708
remove deprecated unittest aliases (#21455)
In a [recent review](https://discuss.python.org/t/experience-with-python-3-11-in-fedora/12911) of the experience of the Fedora team porting packages to the upcoming python 3.11, they remarked that most of the work was in removing deprecated aliases in unittest. I came across a few of these when looking at unrelated test failures, the DeprecationWarnings caught my eye. So a made a quick sweep of the code, using `git grep` to find occurances of the deprecated aliases:

old | new
---|---
assertEquals | assertEqual
assertNotEquals | assertNotEqual
assertRaisesRegexp | assertRaisesRegex
2022-01-09 20:29:54 -08:00
Gagandeep Singh
c43d4cc028
Unskipped test in test_kv_store.py (#21451)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2022-01-09 14:38:55 -08:00
Yi Cheng
bdfba88082
[2/3][kv] Add delete by prefix support for internal kv (#21442)
Delete by prefix for internal kv is necessary to cleanup the function table. This will be used to fix the issue #8822
2022-01-07 22:54:24 -08:00
mwtian
4a34233a90
[Core] allow message in deprecation annotation (#21466) 2022-01-07 21:52:31 -08:00
Simon Mo
f5ac915ed5
[Serve] Detect http.disconnect can cancel handle requests (#21438) 2022-01-07 21:01:34 -08:00
Yi Cheng
8fa9fddaa0
[1/3][kv] move some internal kv py logic into cpp (#21386)
This PR moves the internal kv namespace logic into cpp to reduce logic in python for the following reasons:

- internal kv is used in x-lang so we have to move it to cpp so that all langs can benefit.
- for https://github.com/ray-project/ray/issues/8822 we need to delete resource when job finished in gcs

One extra field about del is also added so that when delete, we are able to delete by prefix instead of just a key
2022-01-07 17:35:06 -08:00
Jiajun Yao
501b78feaa
Remove dead tests related to the old scheduler (#21465) 2022-01-07 12:55:54 -08:00
Amog Kamsetty
123aa7cd2b
[Train] Improve usability for GPU Training (#21464)
Minor changes to improve the user experience for GPU Training.

Addresses https://discuss.ray.io/t/ray-train-doesnt-detect-gpu/4608
2022-01-07 11:53:53 -08:00
Gagandeep Singh
cc1000886a
[serve] Unskip tests in test_fastapi.py (#21422)
These tests pass on my machine. Unskipping them here for CI verification.
2022-01-07 11:27:15 -08:00
mwtian
bbf23ec59f
[GCS] enhance error message when failing to fetch GCS address or connecting to GCS (#21396)
There are test flakiness where GCS client failed to be created, but there is not enough information for debugging. The exception message will be printed after GCS client creation failure. Also, this PR breaks down GCS client creation to two steps: reading GCS address from Redis, and creating GCS client, which should help locating the issue.
2022-01-07 09:56:23 -08:00
Gagandeep Singh
51e4880477
[serve] Unskipped tests in test_constructor_failure.py & test_ray_client.py (#21423)
These tests pass on my machine. Unskipping them here for CI.
2022-01-07 01:53:13 -08:00
Gagandeep Singh
39697cf69c
Unskipped test_snapshot_always_written_to_internal_kv (#21350) 2022-01-07 00:57:23 -08:00
Matti Picus
f3dcd1fac1
WINDOWS: re-enable runtime_env tests, skip cluster tests in serve (#21398)
After enabling tests of test_runtime_env_plugin and test_runtime_env_env_vars (PR #21252) and python/ray/serve:* tests (PR #21107), the analysis at flaky-tests.ray.io starting showing failing tests in the windows://python/ray/test/serv:test_standalone. PR #21352 reverted 21252 (runtime_env tests), but the problem was more likely in the serve tests. Specifically  `test_standalone` has a test that uses Cluster, which should be skipped on windows because it is flaky. So this PR
- re-enables the runtime_env tests for windows
- skips the Cluster test in serve/tests/test_standalone.py
2022-01-06 21:43:58 -08:00
Eric Liang
e9068c45fa
[data] Instrument most remaining dataset functions and add docs (#21412)
This PR finishes most of the stats todos for dataset. The main thing punted for future work is instrumentation of split(), which is particularly tricky since only certain blocks are transformed.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-01-06 17:08:56 -08:00
Alex Wu
8cf4071759
[core] Nested tasks on by default (#20800)
This PR turns worker capping on by default. Note that there are a couple of faulty tests that this uncovers which are fixed here.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-01-06 15:00:03 -08:00
Archit Kulkarni
c7b2d549e3
[runtime env] Fix "conda" field for M1 macs (#21229)
Currently when the "conda" field of runtime_env is specified, we automatically insert the currently running Ray wheel in the conda dependencies (in the nested `pip` list).  This Ray wheel is specified by a URL to Amazon S3, which is where we store our Ray wheels.  

Unfortunately, currently the M1 wheels are built manually and are uploaded directly to PyPI, and this only happens once for each stable release (in contrast to non-M1 wheels which are auto-built and uploaded to S3 for every commit on master and release branches.).  So prior to this PR, if you tried to use the `"conda"` field on M1, it would fail with a message saying it couldn't find the appropriate wheel for the platform.

To fix this, in the case of our Ray cluster running on M1 Mac the only thing we can do for now is to insert `"ray=={ray.__version__}` as our `pip` specifier, instead of the (nonexistent) S3 URL.  

The downside of this approach is (1) nightly wheels and wheels built from commits on master remain unsupported for M1, and (2) we cannot end-to-end test this codepath on a new stable version of Ray before that version is actually released to PyPI.  However, this PR adds a unit test.
2022-01-06 09:48:59 -06:00
Kai Fricke
976ece4bc4
[tune] Add test for heterogeneous resource request deadlocks (#21397)
This adds a test for potential resource deadlocks in experiments with heterogeneous PGFs. If the PGF of a later trial becomes ready before that of a previous trial, we could run into a deadlock. This is currently avoided, but untested, flagging the code path for removal in #21387.
2022-01-06 10:44:30 +00:00
Qing Wang
132e2b2a96
[Core] Remove unused flag put_small_object_in_memory_store (#21284)
Since we have not been using `put_small_object_in_memory_store` flag for a long time, it's should be removed.
2022-01-06 14:46:58 +08:00
xwjiang2010
9528ac62cd
[tune] remove unused return_or_clean_cached_pg. (#21403)
Unused code path.
2022-01-05 23:20:43 +00:00
Gagandeep Singh
62c9fc95ea
[CI] [Serve] Unskipped test and bumped wait time to avoid race condition in test_deploy.py (#21382) 2022-01-05 14:28:42 -08:00