hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-08 19:41:38 -05:00

Author	SHA1	Message	Date
Matti Picus	8104fd5c76	WINDOWS: enable passing metric tests (#21705 )	2022-01-19 17:09:34 -08:00
SangBin Cho	02af73a571	[Test] First core nightly test migration to k8s (#21698 ) The first migration of test into k8s. We are adopting a conservative approach (migrate slowly while we keep existing test suites). Once things are confirmed to be stable, we will migrate with more speed.	2022-01-19 13:31:49 -08:00
SangBin Cho	b1308b1c8c	[Test Infra] Unrevert team col (#21700 ) This fixes the previous problems from team column revert. This has 2 additional changes; alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289 Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time	2022-01-19 13:29:53 -08:00
Eric Liang	88143cdc35	[data] Unify key function type and error handling across sort, groupby, and agg (#21627 ) Prior to this PR, sort, groupby, and aggregate defined separate types for extracting values from Dataset records. This was confusing since the user had to understand the differences between the different key types (which were basically exactly the same). This PR defines a common key type: KeyFn, which is simply Union[None, str, Callable[[T], Any]]. This is used as sort(KeyFn...), aggregate(Agg(KeyFn)...), groupby(KeyFn).agg(Agg(KeyFn), ...). It also unifies the error generation paths to a common _validate_key_fn utility. This also improves the errors generated when passing explicit AggregateFn classes, which previously failed in the workers if invalid.	2022-01-19 11:15:13 -08:00
Kai Fricke	e233f8172d	[ci/release] Terminate session on session startup timeout (#21703 ) When a session startup times out due to resources not being available, the session may still come up after that timeout. At that time the control script (e2e.py) is already terminated, so the session runs until the autosuspend limit is hit, incurring unnecessary costs. Instead, we should always trigger session termination on session timeout.	2022-01-19 10:01:03 -08:00
Kai Fricke	4ef0c6c434	[tune/release] Demote xgboost_sweep to weekly testing (#21704 ) XGBoost functionality is tested daily in the xgboost release test suite. The expensive XGBoost sweep test can thus be run weekly.	2022-01-19 09:15:04 -08:00
Yi Cheng	82103bf7c1	[gcs/ha] Fix cpp tests related to redis removal (#21628 ) This PR fixed cpp tests and also make ray cpp able to pass.	2022-01-19 01:26:34 -08:00
Chen Shen	74d4e7c20c	install botocore with s3fs to ensure no confliction (#21680 )	2022-01-18 23:09:16 -08:00
Kai Fricke	8fd5b7a5a8	Tune test autoscaler / fix stale node detection bug (#21516 ) See #21458. Currently, Tune keeps its own list of alive node IPs, but this information is only updated every 10 seconds and is usually stale when a new node is added. Because of this, the first trial scheduled on this node is usually marked as failed. This PR adds a test confirming this behavior and gets rid of the unneeded code path. Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>	2022-01-18 16:20:16 -08:00
dependabot[bot]	1f563aaf9b	[data](deps): Bump dask[complete] from 2021.11.0 to 2022.1.0 in /python/requirements/data_processing (#21621 ) Bumps [dask[complete]](https://github.com/dask/dask) from 2021.11.0 to 2022.1.0.	2022-01-18 15:32:07 -08:00
mwtian	ef9d9df4e7	[Doc] add comment for waiting for Ray to shutdown in `test_client_reconnect.py` (#21672 )	2022-01-18 12:06:08 -08:00
mwtian	5893a9eddb	[GCS] enable GCS pubsub by default (#21673 ) Turn the flags on by default.	2022-01-18 12:04:53 -08:00
Jiajun Yao	bb04cc9d80	Use latest cmake for pipelined_ingestion and pipelined_training tests (#21674 )	2022-01-18 12:03:43 -08:00
Jiajun Yao	25e62d85bd	[LOGGING][RFC] Add RAY_CHECK_OP (#21607 )	2022-01-18 11:38:26 -08:00
Yao Yuan	422d20e945	[Dashboard] Fix NPE when there is no GPU on the node (#21650 ) There is an NPE bug that causes browser crash when no GPU on the node. We can add a condition to fix it.	2022-01-18 08:12:49 -08:00
Avnish Narayan	12b087acb8	[RLlib] Base env pre-checker. (#21569 )	2022-01-18 16:34:06 +01:00
mickelliu	75078f965d	[Rllib] Fix `range()` (no keyword args supported!) in torch version of `attention_net.py`. (#21598 )	2022-01-18 16:11:16 +01:00
Vince Jankovics	7dc3de4eed	[RLlib] Fix config mismatch for train_one_step. num_sgd_iter instead of sgd_num_iter. (#21555 )	2022-01-18 16:00:27 +01:00
Jiajun Yao	fa5c167717	Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988 ) (#21661 ) This reverts commit `4a55d10bb1`.	2022-01-18 06:11:20 -08:00
Jun Gong	1315293dd8	[RLlib] Fix offline RL(BC & MARWIL) weekly learning tests. (#21643 )	2022-01-18 09:29:01 +01:00
mwtian	4faf3e1e31	[GCS] reenable test_client_reconnect.py for GCS HA builds (#21589 ) In test_client_reconnect.py, each test case starts a Ray cluster via client server's default_connect_handler(). The Ray cluster shuts down implicitly when the start_middleman_server() ended and Python GC'es the client server. After turning on GCS pubsub, the time when client server is GC'ed changes. Sometimes the Ray cluster from a previous test cases stays alive after the next test case starts and shuts down later, leading to test failures due to lost data or crashes (race during worker shutdown, will be investigated separately). This PR makes sure each test case shuts down its Ray cluster.	2022-01-17 23:08:47 -08:00
Guyang Song	c321e6e5bd	[script] support using hostname as node_ip_address (#20720 )	2022-01-18 11:05:50 +08:00
Gagandeep Singh	970b7b2a4b	Unskip tests from `ci.sh` (#21483 )	2022-01-17 15:22:57 -08:00
Rong Ma	f54282147c	[PlacementGroup] Support using any available bundle in java api (#21496 ) In python or C++, we can specify the bundle index as -1 to use any available bundle in the placement group. We should also enable it in Java to keep the API consistent across all languages.	2022-01-18 01:58:02 +08:00
Qing Wang	a5cabb324b	Remove streaming deploying process. (#21603 ) 1. Remove the streaming from deploying to maven central. 2. Remove related streaming stuff from setup.py.	2022-01-17 23:37:48 +08:00
Qing Wang	6f82bff7ff	[Java] Change ActorLifetime API: DEFAULT -> NON_DETACHED (#21639 ) This PR changes the enum value `ActorLifetime.DEFAULT` to `ActorLifetime.NON_DETACHED`. In our release versions, `ActorLifetime` was not introduced <= 1.9.2 Co-authored-by: Qing Wang <jovany.wq@antgroup.com>	2022-01-17 18:10:12 +08:00
Qing Wang	2c3be852ab	[Java] Support defining ConcurrencyGroup statically in Java. (#20373 ) This PR introduces statically defining ConcurrencyGroup APIs in Java. We introduce 2 APIs: 1. Introducing `@DefConcurrencyGroup` annotation for an actor class to define a concurrency group statically. 2. Introducing `@UseConcurrencyGroup` annotation for actor methods to define the concurrency group to be used in the method. Examples are below: ```java @DefConcurrencyGroup(name = "io", maxConcurrency = 2) @DefConcurrencyGroup(name = "compute", maxConcurrency = 4) private static class MyActor { @UseConcurrencyGroup(name = "io") public long f1() { } @UseConcurrencyGroup(name = "io") public long f2() { } @UseConcurrencyGroup(name = "compute") public long f3(int a, int b) { } @UseConcurrencyGroup(name = "compute") public long f4() { } } ActorHandle<> myActor = Ray.actor(MyActor::new).remote(); myActor.task(MyActor::f1).remote(); myActor.task(MyActor::f2).remote(); myActor.task(MyActor::f3).remote(); myActor.task(MyActor::f4).remote(); ``` `MyActor` has 3 concurrency groups: `io` with 2 concurrency, `compute` with 4 concurrency and `default` with 1 concurrency. f1 and f2 will be executed in `io`, f3 and f4 will be executed in `compute`.	2022-01-17 16:23:10 +08:00
Yi Cheng	87d852fc28	[gcs/ha] Fix some tests failed in HA mode (#21587 ) This PR fixed and reenabled tests in HA mode - //python/ray/tests:test_healthcheck - //python/ray/tests:test_autoscaler_drain_node_api - //python/ray/tests:test_ray_debugger	2022-01-16 21:53:14 -08:00
jon-chuang	5f7224bd51	[C++ API] fix wrong arg handling for object references in `TaskExecutor`, `TaskArgByReference` (#21236 ) Previously, ref arg is handled wrongly, serializing the object ref, instead of RayObject to be passed as args buffer to the user function. That's because CoreWorker is the component responsible for ensuring that all ObjectReferences are resolved and serialized into `RayObject`s at the time of the `task_execution_callback` invocation, not any component downstream of the callback. This resulted in the following error for large objects which are not turned into `TaskArg::value` due to being over 100KB. ``` C++ exception with description "Invalid: invalid arguments: std::bad_cast" thrown in the test body. ``` This was not caught due to lack of testing for large objects, which has now been added.	2022-01-17 12:08:15 +08:00
Simon Mo	86bbf28e4c	[CI] Fix test_get_deployment and test_runtime_env_validation (#21637 )	2022-01-16 17:25:14 -08:00
Yi Cheng	927c5467eb	[gcs/function table] Change function table keys' prefix from binary to hex (#21616 ) When cleanup the function table, we use the prefix to delete the data. But right now prefix contains binary data and it won't work well with redis keys/scan which use `*` in the pattern. For example, when job id increases to 41, it'll delete the keys for job 1 which leads to the new worker failing to import the function. This PR uses hex of job id to avoid this.	2022-01-15 21:58:14 -08:00
Kai Fricke	0e9e8824e4	[ci/release] use s3 sync (#21626 ) Previous changes failed because a) permission errors b) unzip being unavailable at remote nodes. Instead we are using tar gzip archives now. This reverts commit `42bcab27e8`.	2022-01-15 17:53:19 -08:00
Kai Fricke	d84154a774	[ci/multinode] Add utilities to kill nodes in multi node testing (#21580 ) Killing nodes enables advanced fault tolerance testing. This PR adds utilities and a test for this functionality in fake multinode docker mode.	2022-01-15 17:11:16 -08:00
Eric Liang	a971774820	Improve errors raised by ds.groupby() of unsupported key type (#21610 )	2022-01-15 16:35:31 -08:00
Kai Yang	4a55d10bb1	[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988 ) This PR adds pandas block format support by implementing `PandasRow`, `PandasBlockBuilder`, `PandasBlockAccessor`. Note that `sort_and_partition`, `combine`, `merge_sorted_blocks`, `aggregate_combined_blocks` in `PandasBlockAccessor` redirects to arrow block format implementation for now. They'll be implemented in a later PR. Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-01-15 17:28:34 +08:00
Archit Kulkarni	26057c433f	[CI] pin uvicorn to 0.16.0 to fix serve (#21612 )	2022-01-14 16:00:51 -08:00
Kai Fricke	42bcab27e8	Revert "[Release Test] Opt-in tests to use K8s based cloud. (#21583 )" (#21605 ) This reverts commit `0d5fbcc7bb`.	2022-01-14 11:46:52 -08:00
Gagandeep Singh	f8bcb8aeb6	Unskipped tests in `test_actor.py` (#21501 )	2022-01-14 08:46:46 -08:00
Jun Gong	7517aefe05	[RLlib] Bring back BC and Marwil learning tests. (#21574 )	2022-01-14 14:35:32 +01:00
Jialing He	ded4128ebf	[Core] dlmalloc allocate bottom-most memory chunk failed (#21439 ) Why are these changes needed? fix dlmalloc allocate bug, details in here #21310 * fix dlmalloc bug * make lint happy * make lint happy * fix by comment * use _check_spilled_mb * add cpp UT	2022-01-13 23:53:29 -08:00
Jiajun Yao	e0f4636477	Fix simple dataset sort generating only 1 non-empty block (#21588 )	2022-01-13 23:50:24 -08:00
Richard Liaw	169e422937	[docs] Make Jobs more prominent in documentation (#21575 )	2022-01-13 23:49:34 -08:00
Matti Picus	f4da0410b3	WINDOWS: unskip actor, component_failure, failure tests (#21492 ) Unskip windows tests that pass locally	2022-01-13 23:16:22 -08:00
Stephanie Wang	1df67eb977	[core] Avoid ObjectID collisions for re-executed tasks (#21395 ) If a task is re-executed on failure, it will deterministically generate the same IDs for any ray.put or .remote task calls because it uses its own task ID as a seed. This can cause problems if those objects conflict with previous versions that still exist in the cluster. This PR adds the execution attempt number to the current task ID seed. This avoids collisions with any ObjectIDs generated by the previous execution attempt of the task. Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>	2022-01-13 18:18:55 -08:00
Yi Cheng	e4ba51f25b	[core] Add GC for function table (#21509 ) In Ray, functions are exported to the function table during runtime. But it's not cleaned up after use. This PR garbage collects the resource when there is no job/detached actor referencing the resource. Ideally, we should move the function table imports/exports feature to core, so gcs function manager is introduced, and currently, it's for reference counting only.	2022-01-13 18:06:05 -08:00
Simon Mo	0d5fbcc7bb	[Release Test] Opt-in tests to use K8s based cloud. (#21583 )	2022-01-13 17:20:36 -08:00
Yi Cheng	6dccfbffa9	Revert "Revert "[gcs] turn on grpc pubsub by default"" (#21585 ) Reverts ray-project/ray#21584 and turn the flag off	2022-01-13 16:12:03 -08:00
mwtian	30968a9358	[GCS] support external Redis in GCS bootstrapping mode (#21436 ) External Redis should still be supported with GCS bootstrapping, to avoid breaking users. In GCS mode, some logic are removed for external Redis: - Printing external Redis addresses to terminal: hard to implement across `ray start`, `ray.init()` and Ray cluster util. - Starting local Redis if external Redis is unavailable: failing loudly here seems more appropriate. Also, re-enable a few tests which restarts GCS in GCS bootstrapping mode, by using external Redis for KV storage.	2022-01-13 16:01:11 -08:00
Jiajun Yao	d6dbf3b8bf	[scheduler] Set default max_pending_lease_requests_per_scheduling_category to 10 (#20404 )	2022-01-13 13:50:56 -08:00
Yi Cheng	bc696212d2	Revert "[gcs] turn on grpc pubsub by default" (#21584 ) test-reconnect seems flaky. Reverts ray-project/ray#21513	2022-01-13 12:34:02 -08:00

1 2 3 4 5 ...

10979 commits