Commit graph

5894 commits

Author SHA1 Message Date
SangBin Cho
b6d3e01e0b
Revert "WINDOWS: enable passing metric tests (#21705)" (#21738)
This reverts commit 8104fd5c76.
2022-01-20 07:27:49 -08:00
Max Pumperla
38e46c9fb3
[docs] Clean up doc structure (first part) (#21667) 2022-01-20 16:19:04 +01:00
mwtian
a4581e58ee
[Pubsub] improve error handling for GCS AIO subscribers in dashboard (#21712)
- Tolerate GRPC deadline exceeded and transient failures in Python GCS AIO subscribers, which becomes consistent with Python GCS synchronous subscribers.
- Tolerate any exception in dashboard for subscribing to logs and error info, which becomes consistent with how dashboard handles GRPC errors for obtaining node stats.
2022-01-20 07:04:54 -08:00
Hao Chen
8dcc07ec9c
[Fix][Locality] ref count should remove object locations for dead nodes (#21548)
When a node is dead, reference table should remove locations for those objects on the node. Otherwise locality-aware scheduling will schedule tasks to the dead node.
2022-01-20 11:58:52 +08:00
Philipp Moritz
fbc51d6d0e
[Kuberay] Ray Autoscaler integration with Kuberay (MVP) (#21086)
This is a minimum viable product for Ray Autoscaler integration with Kuberay. It is not ready for prime time/general use, but should be enough for interested parties to get started (see the documentation in kuberay.md).
2022-01-19 19:42:17 -08:00
Wilson Wang
2626c64060
Fix monitor.py exceptions. Enable fetching GCS address from Redis with retries. (#21533)
GCS, when running as an individual component, can cause other components to fail in case of crashes. 

Here are two main cases covered in this patch:

1. monitor.py will raise an exception when disconnected from GCS.
2. When GCS becomes available later than other components, the missing KV of GCS address can cause other components to fail to start.


In our patch, we fixed these two issues as well as increased the timeout for redis connection which was too small.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-19 18:48:03 -08:00
Matti Picus
8104fd5c76
WINDOWS: enable passing metric tests (#21705) 2022-01-19 17:09:34 -08:00
Eric Liang
88143cdc35
[data] Unify key function type and error handling across sort, groupby, and agg (#21627)
Prior to this PR, sort, groupby, and aggregate defined separate types for extracting values from Dataset records. This was confusing since the user had to understand the differences between the different key types (which were basically exactly the same).

This PR defines a common key type: KeyFn, which is simply Union[None, str, Callable[[T], Any]]. This is used as sort(KeyFn...), aggregate(Agg(KeyFn)...), groupby(KeyFn).agg(Agg(KeyFn), ...).

It also unifies the error generation paths to a common _validate_key_fn utility. This also improves the errors generated when passing explicit AggregateFn classes, which previously failed in the workers if invalid.
2022-01-19 11:15:13 -08:00
Yi Cheng
82103bf7c1
[gcs/ha] Fix cpp tests related to redis removal (#21628)
This PR fixed cpp tests and also make ray cpp able to pass.
2022-01-19 01:26:34 -08:00
Kai Fricke
8fd5b7a5a8
Tune test autoscaler / fix stale node detection bug (#21516)
See #21458. Currently, Tune keeps its own list of alive node IPs, but this information is only updated every 10 seconds and is usually stale when a new node is added. Because of this, the first trial scheduled on this node is usually marked as failed. This PR adds a test confirming this behavior and gets rid of the unneeded code path.

Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
2022-01-18 16:20:16 -08:00
dependabot[bot]
1f563aaf9b
[data](deps): Bump dask[complete] from 2021.11.0 to 2022.1.0 in /python/requirements/data_processing (#21621)
Bumps [dask[complete]](https://github.com/dask/dask) from 2021.11.0 to 2022.1.0.
2022-01-18 15:32:07 -08:00
mwtian
ef9d9df4e7
[Doc] add comment for waiting for Ray to shutdown in test_client_reconnect.py (#21672) 2022-01-18 12:06:08 -08:00
Jiajun Yao
fa5c167717
Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988) (#21661)
This reverts commit 4a55d10bb1.
2022-01-18 06:11:20 -08:00
mwtian
4faf3e1e31
[GCS] reenable test_client_reconnect.py for GCS HA builds (#21589)
In test_client_reconnect.py, each test case starts a Ray cluster via client server's default_connect_handler(). The Ray cluster shuts down implicitly when the start_middleman_server() ended and Python GC'es the client server. After turning on GCS pubsub, the time when client server is GC'ed changes. Sometimes the Ray cluster from a previous test cases stays alive after the next test case starts and shuts down later, leading to test failures due to lost data or crashes (race during worker shutdown, will be investigated separately).

This PR makes sure each test case shuts down its Ray cluster.
2022-01-17 23:08:47 -08:00
Guyang Song
c321e6e5bd
[script] support using hostname as node_ip_address (#20720) 2022-01-18 11:05:50 +08:00
Gagandeep Singh
970b7b2a4b
Unskip tests from ci.sh (#21483) 2022-01-17 15:22:57 -08:00
Qing Wang
a5cabb324b
Remove streaming deploying process. (#21603)
1. Remove the streaming from deploying to maven central.
2. Remove related streaming stuff from setup.py.
2022-01-17 23:37:48 +08:00
Yi Cheng
87d852fc28
[gcs/ha] Fix some tests failed in HA mode (#21587)
This PR fixed and reenabled tests in HA mode

- //python/ray/tests:test_healthcheck
- //python/ray/tests:test_autoscaler_drain_node_api 
- //python/ray/tests:test_ray_debugger
2022-01-16 21:53:14 -08:00
Simon Mo
86bbf28e4c
[CI] Fix test_get_deployment and test_runtime_env_validation (#21637) 2022-01-16 17:25:14 -08:00
Yi Cheng
927c5467eb
[gcs/function table] Change function table keys' prefix from binary to hex (#21616)
When cleanup the function table, we use the prefix to delete the data. But right now prefix contains binary data and it won't work well with redis keys/scan which use `*` in the pattern.

For example, when job id increases to 41, it'll delete the keys for job 1 which leads to the new worker failing to import the function.

This PR uses hex of job id to avoid this.
2022-01-15 21:58:14 -08:00
Kai Fricke
d84154a774
[ci/multinode] Add utilities to kill nodes in multi node testing (#21580)
Killing nodes enables advanced fault tolerance testing. This PR adds utilities and a test for this functionality in fake multinode docker mode.
2022-01-15 17:11:16 -08:00
Eric Liang
a971774820
Improve errors raised by ds.groupby() of unsupported key type (#21610) 2022-01-15 16:35:31 -08:00
Kai Yang
4a55d10bb1
[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988)
This PR adds pandas block format support by implementing `PandasRow`, `PandasBlockBuilder`, `PandasBlockAccessor`.

Note that `sort_and_partition`, `combine`, `merge_sorted_blocks`, `aggregate_combined_blocks` in `PandasBlockAccessor` redirects to arrow block format implementation for now. They'll be implemented in a later PR.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-01-15 17:28:34 +08:00
Archit Kulkarni
26057c433f
[CI] pin uvicorn to 0.16.0 to fix serve (#21612) 2022-01-14 16:00:51 -08:00
Gagandeep Singh
f8bcb8aeb6
Unskipped tests in test_actor.py (#21501) 2022-01-14 08:46:46 -08:00
Jialing He
ded4128ebf
[Core] dlmalloc allocate bottom-most memory chunk failed (#21439)
Why are these changes needed?
fix dlmalloc allocate bug, details in here #21310
* fix dlmalloc bug

* make lint happy

* make lint happy

* fix by comment

* use _check_spilled_mb

* add cpp UT
2022-01-13 23:53:29 -08:00
Jiajun Yao
e0f4636477
Fix simple dataset sort generating only 1 non-empty block (#21588) 2022-01-13 23:50:24 -08:00
Matti Picus
f4da0410b3
WINDOWS: unskip actor, component_failure, failure tests (#21492)
Unskip windows tests that pass locally
2022-01-13 23:16:22 -08:00
Stephanie Wang
1df67eb977
[core] Avoid ObjectID collisions for re-executed tasks (#21395)
If a task is re-executed on failure, it will deterministically generate the same IDs for any ray.put or .remote task calls because it uses its own task ID as a seed. This can cause problems if those objects conflict with previous versions that still exist in the cluster.

This PR adds the execution attempt number to the current task ID seed. This avoids collisions with any ObjectIDs generated by the previous execution attempt of the task.

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2022-01-13 18:18:55 -08:00
Yi Cheng
e4ba51f25b
[core] Add GC for function table (#21509)
In Ray, functions are exported to the function table during runtime. But it's not cleaned up after use. This PR garbage collects the resource when there is no job/detached actor referencing the resource.

Ideally, we should move the function table imports/exports feature to core, so gcs function manager is introduced, and currently, it's for reference counting only.
2022-01-13 18:06:05 -08:00
Yi Cheng
6dccfbffa9
Revert "Revert "[gcs] turn on grpc pubsub by default"" (#21585)
Reverts ray-project/ray#21584 and turn the flag off
2022-01-13 16:12:03 -08:00
mwtian
30968a9358
[GCS] support external Redis in GCS bootstrapping mode (#21436)
External Redis should still be supported with GCS bootstrapping, to avoid breaking users.
In GCS mode, some logic are removed for external Redis:
- Printing external Redis addresses to terminal: hard to implement across `ray start`, `ray.init()` and Ray cluster util.
- Starting local Redis if external Redis is unavailable: failing loudly here seems more appropriate.

Also, re-enable a few tests which restarts GCS in GCS bootstrapping mode, by using external Redis for KV storage.
2022-01-13 16:01:11 -08:00
Jiajun Yao
d6dbf3b8bf
[scheduler] Set default max_pending_lease_requests_per_scheduling_category to 10 (#20404) 2022-01-13 13:50:56 -08:00
Yi Cheng
bc696212d2
Revert "[gcs] turn on grpc pubsub by default" (#21584)
test-reconnect seems flaky.
Reverts ray-project/ray#21513
2022-01-13 12:34:02 -08:00
mwtian
cf6a54ca46
[CI] pin pytest-asyncio (#21579) 2022-01-13 11:35:30 -08:00
Kai Fricke
a3442df584
[ci/multinode] Build multinode image with OpenSSH before running tests (#21544)
Currently we install OpenSSH on the fly in fake multinode docker testing. Instead we can speed testing up a fair bit by building a Docker image which includes OpenSSH first and then run tests with this image.
2022-01-13 08:47:04 -08:00
Gagandeep Singh
d392f97331
Unskipped tests in serve: test_controller_recovery.py (#21450) 2022-01-13 01:09:59 -08:00
Yi Cheng
6194783312
[gcs] turn on grpc pubsub by default (#21513)
Turn on grpc pubsub by default.  This PR also fixed several tests which are failed before.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-12 22:13:03 -08:00
Gagandeep Singh
13f20e5e1e
[Serve] Unskipped tests in test_pipeline.py (#21484) 2022-01-12 13:56:50 -08:00
Kai Fricke
c1d4c22351
[ci/multinode] Follow-up fix for resource popping (#21543)
The previous change in #21531 unfortunately added the fix at the wrong location. This PR corrects this.
2022-01-11 17:47:42 -08:00
Antoni Baum
6ba4513777
[tune] Load experiment into searcher (#21506)
This PR adds a new method to the Searcher class, add_evaluated_trials. This method wraps around add_evaluated_point and allows the user to pass a Trial, list of Trials or ExperimentAnalysis to load into the searcher. Furthermore, this PR updates the HEBO version to the latest and removes outdated documentation, and adds add_evaluated_point methods to Dragonfly and SkOpt searchers.
2022-01-11 15:58:20 -08:00
Matti Picus
ec6a33b736
[tune] fixes to allow tune/tests/test_commands.py to run on windows (#21342)
tune does not run smoothly on Windows. This cleans up some blockers:
- use cross-platform shutils.get_terminal_size instead of Popen(stty)
- somehow Trainer.workers is None at the end of test_commands.py, so the cleanup command was erroring. The error was not fatal, but was printing in the logs.
- if run locally, the log files are all written to the same location, so the rync-based syncing solution is not needed. This is the real fix for issue #20747
2022-01-11 15:57:20 -08:00
mwtian
45cddef2d3
[GCS] disable tests related to GCS restarting in GCS pubsub mode (#21534)
`test_failure_2.py::test_gcs_server_failiure_report` and `test_gcs_fault_tolerance.py::test_gcs_server_restart_during_actor_creation` cannot pass in GCS pubsub mode with the existing logic. Disable these tests in GCS pubsub mode and add comment about how we may fix them.

Also, suppress exceptions when sync subscribers are disconnected from GCS.

I can push changes in this PR to #21513 as well.
2022-01-11 14:14:05 -08:00
Kai Fricke
084bda87a5
[ci/multinode] Fix resource popping resulting in empty resource head nodes (#21531)
Fixes a small bug where we pop from the resources dict without making a copy, emptying the head node resources. This sometimes leads to empty head node resources.
2022-01-11 13:20:58 -08:00
Yi Cheng
d2d749b6f9
[workflow] Fix test_serialization.py (#21522)
The new version of responses will introduce some errors in the test. This PR fixed responses.

It also fixed moto in case of future updates upstream.
2022-01-11 11:45:18 -08:00
mwtian
0e5de61c18
remove unnecessary test filter (#21510)
(Comment from the PR:)
If a GRPC call exceeds timeout, the calls is cancelled at client side but server may still reply to it, leading to missed messages and test failures. Using a sequence number to ensure no message is dropped can be the long term solution,
but its complexity and the fact the Ray subscribers do not use deadline in production makes it less preferred.
Therefore, a simpler workaround is used instead: a different subscriber is used for each get_error_message() call.

Also, re-enable some additional tests in GCS HA mode.
2022-01-11 10:17:03 -08:00
Gagandeep Singh
d47b82883a
Unskipped non-cluster tests in test_actor_resources.py (#21500) 2022-01-11 09:46:03 -08:00
Gagandeep Singh
a5a8156198
Unskipped tests in test_actor_failures (#21498) 2022-01-11 09:42:12 -08:00
Gagandeep Singh
e8df34af08
Unskipped test in test_autoscaling_policy (#21497)
The test passes on my Windows Azure VM. 

P.S. Is it related to cluster tests? I am not sure.
2022-01-11 09:40:37 -08:00
Eric Liang
9ac34ecc94
Revert "[workflow] Skip saving outputs of "workflow.wait"" (#21520)
This is breaking linux://python/ray/workflow:tests/test_wait per https://flakey-tests.ray.io/
2022-01-10 20:51:42 -08:00