hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Dmitri Gekhtman	b2b442297e	[autoscaler] Fix initialization artifacts (#22570 ) This PR fixes initializations artifacts related to the load metric summary and autoscaler summary. Load metrics summaries are defined to be Falsey if the autoscaler has never received a resource message from the GCS. We skip most autoscaler actions if load metrics is Falsey, because it doesn't makes sense to autoscale without load metrics. This also allows us to execute the TODO here: #22348 (comment) and remove the time.wait(). As for the autoscaler summary, it is possible for autoscaler.summary() to error outside of an autoscaler update in this scenario: The very first call to NodeProvider.non_terminated_nodes fails, self.non_terminated_nodes remains a None object, and autoscaler.summary() fails trying to get an attribute of this None object. The result is a confusing error message, as in #22515. This PR fixes that. Closes #22515	2022-02-24 20:05:44 -08:00
Simon Mo	bfb619a127	[xlang] Allow Python to call overloaded methods with differing number of parameters (#21410 )	2022-02-24 16:51:38 -08:00
Archit Kulkarni	1165f99b0b	[CI] disable Serve microbenchmark k8s (#22631 )	2022-02-24 16:50:06 -08:00
Yi Cheng	de76d86bcb	[nightly] Stop GCS HA related nightly test (#22636 ) Since we've already turned it on on master, we should stop these tests for now.	2022-02-24 16:40:08 -08:00
ZhuSenlin	5efeb6534b	[Core] Bug fix about FixedPoint (#22584 ) * Fix FixedPoint::operator-(double const d) * add unit test * remove FixedPoint(uint32_t i) Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>	2022-02-24 15:44:21 -08:00
Jiao	3c707f70cc	[2/X][Pipeline] Add python generation for ClassNode (#22617 ) - Added backbone of ray dag -> serve dag transformation and deployment extraction. - Added util functions for deployment unique name generation .. ray_actor_options, replacement of DeploymentNode with deployment handle, etc.	2022-02-24 16:01:35 -06:00
Jun Gong	a385c9b127	[RLlib] Update bandit_envs_recommender_system (#22421 )	2022-02-24 22:43:41 +01:00
Simon Mo	3d3218d153	[CI] Add K8s Builder Step (#22035 )	2022-02-24 13:11:38 -08:00
Sven Mika	526fd6b5fb	[RLlib] Issue 22444: KL-coeff not stored in persistent policy state. (#22590 )	2022-02-24 22:05:36 +01:00
Siyuan (Ryans) Zhuang	8f4f3cb79b	Make shellcheck optional	2022-02-24 12:04:05 -08:00
Eric Liang	533a0440a6	Improve actor pool support in Datasets (#22574 )	2022-02-24 12:01:36 -08:00
Amog Kamsetty	02cb974c6c	[Train] Fix fault tolerance for Tensorflow (#22508 ) Soft restarts don't work for tensorflow since there is still some leftover communication state in the actors which may lead to undefined behavior, such as causing training to hang. Instead, this PR changes the failure handling for tensorflow to match torch and horovod, and recreates all the workers in case of failure. Also adds a test to check if fault tolerance works correctly for an actual tensorflow example. When testing locally, the test failed before the change, but passes after.	2022-02-24 11:50:20 -08:00
Chen Shen	03f3bc302c	[Scheduler] Fix string id map bug (#22586 ) * preserve * fix bug	2022-02-24 09:55:21 -08:00
Siyuan (Ryans) Zhuang	ec23050df6	Error if shellcheck is not installed (#22556 )	2022-02-24 09:53:03 -08:00
Simon Mo	b8c28d1f2b	[Tune] Make sure tune.run can run inside worker thread (#22566 )	2022-02-24 09:50:42 -08:00
Jun Gong	99b7be5e22	[rllib] Fix impala long running test (#22619 ) fix impala long running test. Bandits is the first agent that requires torch import at registration time.	2022-02-24 09:03:55 -08:00
shrekris-anyscale	a9ede4e499	[serve] Add REST API (#22578 ) This change adds the GET, PUT, and DELETE commands for Serve’s REST API. The dashboard receives these commands and issues corresponding requests to the Serve controller.	2022-02-24 10:00:26 -06:00
Sven Mika	18c269c70e	[RLlib] Issue 22539: agent_key not deleted from 2 dicts in simple list collector. (#22587 )	2022-02-24 11:58:34 +01:00
Tao Wang	8906305ab8	[Tiny][Core]save memory copy for getting data in gcs storage (#22582 ) When get a bunch of data from redis, we first initialize local variables and then put them in vector, which bring so much copies from stack to heap or from local variables to vector. This tiny little change would save the copies.	2022-02-24 14:15:27 +08:00
SangBin Cho	5e847f7e09	[Usage Stats] Usage stats only enabled on nightly test infra (#22591 ) This PR enables the usage stats only on the release test infrastructure (large scale tests Ray runs on a daily basis in a private infra). Note it is still disabled by default in Ray.	2022-02-23 22:11:48 -08:00
jon-chuang	11500dc12c	[docs] include ray status and ray monitor into ray command line api docs (#22614 ) Fixes: https://github.com/ray-project/ray/issues/18527	2022-02-23 20:09:45 -08:00
Amog Kamsetty	80e0d9cea4	[Train] Update docs for ray.train.torch import (#22555 ) Update more examples to include the ray.train.torch import line. Follow up to #21969	2022-02-23 19:22:27 -08:00
Liu Bao	6a9a28612c	[runtime env] Async pip runtime env (#22381 ) In order to initialize runtime env concurrently, this PR makes pip runtime env asynchronous. It includes, - [x] New `check_output_cmd` in runtime env utils. - [x] Async PipProcessor. - [x] The `asynccontextmanager` from `https://github.com/python-trio/async_generator` for Python 3.6 - [x] Remove pip runtime env lock. - [x] Disable pip cache. Co-authored-by: 刘宝 <po.lb@antfin.com>	2022-02-24 11:03:40 +08:00
Qing Wang	eb9960785b	[Core][Remove JVM FullGC 1/N] Add allocator to in-memory store. (#21250 ) According to the description of #21218 , in this PR, we support the ability specifying a frontend-defined in-memory object allocator. So that we can specify an allocator to allocate the buffers from JVM heap. This is the basic functionality for the next PR #21441 that the JVM is able to be aware of the memory pressure of the in-memeory store objects. Note that, if we use a frontend defined allocator, it may break the zerocopy ability. In Java, JVM buffers is in heap and we should copy it to native memory if needed. Co-authored-by: Qing Wang <jovany.wq@antgroup.com>	2022-02-24 10:53:59 +08:00
Eric Liang	e15a419028	Enable stage fusion by default for dataset pipelines (#22476 ) This PR enables stage fusion for dataset pipelines. This also requires: 1. Removing the num_cpus=0.5 default for the read stage, to enable fusion of the read stage. 2. Removing spread_resource_prefix (not supported for now).	2022-02-23 17:34:05 -08:00
Eric Liang	a62a9c38fb	Fix [Bug] Splitting Dataset when shards < n hangs or errors (#22559 )	2022-02-23 15:54:25 -08:00
Edward Oakes	5a21289a34	[runtime_env] Remove get_current_runtime_env from docs (#22594 ) We should just encourage people to use the existing `get_runtime_context` API instead of introducing a new one here. Just removing the docs for now while we discuss this.	2022-02-23 16:53:52 -06:00
Archit Kulkarni	87f7bfe4cd	[doc] [job submission] Add k8s instructions and a comment about ports (#22598 )	2022-02-23 16:32:37 -06:00
Eric Liang	fc75d17701	Fix [Bug] DatasetPipeline .iter_epochs() can lead to infinite loops (#22572 )	2022-02-23 13:35:31 -08:00
Siyuan (Ryans) Zhuang	f6f0fea102	Symlink workflow for development (#22554 )	2022-02-23 13:14:05 -08:00
Siyuan (Ryans) Zhuang	2e0186a5b6	[workflow] Checkpoint API (#19406 ) checkpoint API * ensure commit_step only do checkpointing	2022-02-23 13:09:08 -08:00
Chris K. W	3371e78d2e	[client] Chunk PutRequests (#22327 ) Why are these changes needed? Data from PutRequests is chunked into 64MiB messages over the datastream, to avoid the 2GiB message size limit from gRPC. This will allow users to transfer objects larger than 2GiB over the network. Proto changes Put requests now have fields for chunk_id to identify which chunk data belongs to, total_chunks to identify the total number of chunks in the object, and total_size for total size in bytes of the object (useful for raising warnings). PutObject is still unary-unary. The dataservicer handles reassembling the chunks before passing the result to the underlying RayletServicer. Dataclient changes If a put request is inserted into the request queue, self._requests will chunk it lazily. Doing this lazily is important since inserting all of the chunks onto the request queue immediately would double the amount of memory needed to handle a large request. This also guarantees that the chunks of a given putrequest will be contiguous Dataservicer changes The dataservicer now maintains some state to track received chunks. Once all chunks for a putrequest are received, the combined chunks are passed to the raylet servicer.	2022-02-23 18:21:25 +02:00
Jiao	a20748f83a	[1/X][Pipeline] Add deployment nodes (#22549 ) Ray DAG Changes - Restructured and resolves circular imports in current dag_node.py. - Moved `__str__` to each DAGNode subclass level with centralized utils imports - Removed restrictions on binding `InputNode` to `FunctionNode` and `ClassMethodNode` - Moved `_contain_input_node` to only `ClassNode` and `DeploymentNode` Serve DAG Changes - Added DeploymentNode - Cannot be directly constructed - Holds deployment func or class body as well as handle that trivially maps to `__call__` method (match current behavior) - Upon accessing an attribute, it will spawn DeploymentMethodNode node with `other_args_to_resolve` passed in to differentiate sync handle type and others - Added DeploymentMethodNode - Holds arg and deployment handle - Executing on it translate to deployment handle call on the method.	2022-02-23 09:56:24 -06:00
Sven Mika	8e00537b65	[RLlib] SlateQ: framework=tf fixes and SlateQ documentation update (#22543 )	2022-02-23 13:03:45 +01:00
Qing Wang	bf5693e0b1	[Java] Remove GetGcsClient (#22542 ) This PR removes GetGcsClient from core worker and gets necessary data in Java worker.	2022-02-23 03:41:32 -08:00
Qing Wang	96924ecfc0	[Java] Add javac.activative dependency for java worker. (#22538 ) This PR adds `javac.activative` as Java worker dependency to address the issue that some users need `JAXB` on >= JDK9.	2022-02-23 16:24:47 +08:00
Lingxuan Zuo	46cb246d75	[Symbols]Exporting openceus for streaming outside (#22526 ) Opencenus symobls haven been exported in linux version of libcore_worker_library_java.so, but deleted from ray_exported_symbols.lds, which makes streaming macos test case failed. This pr add this exporting record and rename raystreaming* stuff to rayinternal* that's a united entry to ray cpp. Co-authored-by: 林濯 <lingxuan.zlx@antgroup.com>	2022-02-23 16:24:16 +08:00
Xuehai Pan	018ebbf4cb	[RLlib] Issue #21671 : Handle callbacks and model metrics for `TorchPolicy` while using multi-GPU optimizers (#21697 )	2022-02-23 08:30:38 +01:00
Jiajun Yao	82443aec63	Remove DEFAULT_SCHEDULING_STRATEGY and SPREAD_SCHEDULING_STRATEGY (#22558 )	2022-02-22 21:34:21 -08:00
Stephanie Wang	abf2a70a29	[core] Add task and object reconstruction status to ray memory (#22317 ) Improve observability for general objects and lineage reconstruction by adding a "Status" field to `ray memory`. The value of the field can be: ``` // The task is waiting for its dependencies to be created. WAITING_FOR_DEPENDENCIES = 1; // All dependencies have been created and the task is scheduled to execute. SCHEDULED = 2; // The task finished successfully. FINISHED = 3; ``` In addition, tasks that failed or that needed to be re-executed due to lineage reconstruction will have a field listing the attempt number. Example output: ``` IP Address \| PID \| Type \| Call Site \| Status \| Size \| Reference Type \| Object Ref 192.168.4.22 \| 279475 \| Driver \| (task call) ... \| Attempt #2: FINISHED \| 10000254.0 B \| LOCAL_REFERENCE \| c2668a65bda616c1ffffffffffffffffffffffff0100000001000000 ```	2022-02-22 21:26:21 -08:00
Eric Liang	9261428004	Drop level of spammy log message (#22576 )	2022-02-22 21:23:34 -08:00
shrekris-anyscale	40fa56f40c	[serve] Add JSON schemas for REST API (#22547 )	2022-02-22 21:36:42 -06:00
mwtian	9a157dfe82	[GCS-Ray] update doc and error message for GCS-Ray (#22528 ) Update documentation to reflect that Ray no longer starts Redis by default.	2022-02-22 17:56:30 -08:00
Eric Liang	12dcec8b38	Fix [Datasets] iter_epochs not iterating using native format	2022-02-22 15:47:16 -08:00
SangBin Cho	36a31cb6fd	[Usage Stats] Implement usage stats report "Turned off by default". (#22249 ) This is the second PR to implement usage stats on Ray. Please refer to the file usage_lib.py for more details. The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj. This adds a dashboard module to enable usage stats. Usage stats report is turned off by default after this PR. We can control the report (enablement, report period, and URL. Note that URL is strictly for testing) using the env variable. ## NOTE This requires us to add `requests` to the default library. `requests` must be okay to be included because 1. it is extremely lightweight. It is implemented only with built-in libs. 2. It is really stable. The project basically claims they are "deprecated", meaning no new features will be added there. cc @edoakes @richardliaw for the approval For the HTTP request, I was alternatively considered httpx, but it was not as lightweight as `requests`. So I decided to implement async requests using the thread pool.	2022-02-22 15:32:02 -08:00
Antoni Baum	a1230b9291	[tune] Note `TPESampler` performance issues in docs (#22545 )	2022-02-22 15:29:12 -08:00
Edward Oakes	58e5f0140d	[jobs] Rename JobData -> JobInfo (#22499 ) `JobData` could be confused with the actual output data of a job, `JobInfo` makes it more clear that this is status information + metadata.	2022-02-22 16:18:16 -06:00
Yi Cheng	e3051ebf67	[ci] Fix grpcio 1.44 break test_output (#22494 ) This PR limit grpc to be <= 1.42. This will fix testoutput.	2022-02-22 13:59:25 -08:00
Dmitri Gekhtman	a402e956a4	[KubeRay] Format autoscaling config based on RayCluster CR (#22348 ) Closes #21655. At the start of each autoscaler iteration, we read the Ray Cluster CR from K8s and use it to extract the autoscaling config.	2022-02-22 11:06:37 -08:00
Antoni Baum	4a15c6f8f3	[tune] Preparation for deadline schedulers (#22006 )	2022-02-22 11:05:28 -08:00

1 2 3 4 5 ...

11418 commits