hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Sven Mika	c288b97e5f	[RLlib] Issue 21629: Video recorder env wrapper not working. Added test case. (#21670 )	2022-01-24 19:38:21 +01:00
Antoni Baum	850eb88cde	[tune] Fix analysis without registered trainable (#21475 ) This PR fixes issues with loading ExperimentAnalysis from path or pickle if the trainable used in the trials is not registered. Chiefly, it ensures that the stub attribute set in load_trials_from_experiment_checkpoint doesn't get overridden by the state of the loaded trial, and that when pickling, all trials in ExperimentAnalysis are turned into stubs if they aren't already. A test has also been added.	2022-01-24 08:27:08 -08:00
Guyang Song	f8e41215b3	[1/n][cross-language runtime env] runtime env protobuf refactor (#21551 ) We need to support runtime env for java、c++ and cross-language. This PR only do a refactor of protobuf. Related issue #21731	2022-01-24 19:24:59 +08:00
Chen Shen	a60251f47a	[Core] Fix 16GB mac perf issue by limit the plasma store size to 2GB (#21224 ) * add changes * as title * fix * max to min * fix tests	2022-01-24 01:52:59 -08:00
Lingxuan Zuo	ec62d7f510	[Streaming]Farewell : remove all of streaming related from ray repo. (#21770 ) New repo url is https://github.com/ray-project/mobius Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>	2022-01-23 17:53:41 +08:00
Gagandeep Singh	2da2ac52ce	Unskipped test_worker_stdout (#21708 )	2022-01-22 02:43:03 -08:00
Qing Wang	a37d9a2ec2	[Core] Support default actor lifetime. (#21283 ) Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item. #### API Change The Python API looks like: ```python ray.init(job_config=JobConfig(default_actor_lifetime="detached")) ``` Java API looks like: ```java System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name()); Ray.init(); ``` One example usage is: ```python ray.init(job_config=JobConfig(default_actor_lifetime="detached")) a1 = A.options(lifetime="non_detached").remote() # a1 is a non-detached actor. a2 = A.remote() # a2 is a non-detached actor. ``` Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Qing Wang <jovany.wq@antgroup.com>	2022-01-22 12:26:08 +08:00
Gagandeep Singh	b00385f9a2	Using a deterministic approach to check connections in `test_job_timestamps` (#21693 )	2022-01-21 20:18:05 -08:00
xwjiang2010	0abcd5eea5	[tune] only sync up and sync down checkpoint folder for cloud checkpoint. (#21658 ) By default, ~/ray_results/exp_name/trial_name/checkpoint_name. Instead of the whole trial checkpoint (~/ray_results/exp_name/trial_name/) directory. Stuff like progress.csv, result.json, params.pkl, params.json, events.out etc are coming from driver process. This could also enable us to de-couple sync up and delete - they don't have to wait for each other to finish.	2022-01-21 17:56:05 -08:00
matthewdeng	8119b62640	[train] refactor callback logdir and results preprocessors (#21468 ) * [train] Add TorchTensorboardProfilerCallback and introduce ResultsPreprocessors * simplify profiler * read on get_and_clear_profile_traces * refactor callbacks * remove var * Update python/ray/train/callbacks/logging.py Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> * Update python/ray/train/callbacks/results_prepocessors/keys.py Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> * address comments; add tests * fix test * address comments * docs * address comments' * fix test Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-01-21 17:23:34 -08:00
Junwen Yao	216c4bf9a6	[Serve] warn when serve.start() with different options (#21562 )	2022-01-21 15:51:26 -08:00
shrekris-anyscale	45eebdd6e3	[Serve] Handle name collisions when unzipping directories in unzip_package (#21723 ) Currently, the `unzip_package` function relies on `extract_file_and_remove_top_level_dir` to unzip and remove the top-level directory from archive working directories. However, `extract_file_and_remove_top_level_dir` uses `os.rename()` to remove the tld by manually unzipping each file from a zip file and moving it to the tld's parent. When the tld contains directories or files with the same name as the tld, `os.rename()` fails to move these files to the tld's parent because of the name collision between the file and the tld. This change replaces `extract_file_and_remove_top_level_dir` with `remove_dir_from_filepaths`. Now, `unzip_package` unzips the entire zip file before `remove_dir_from_filepaths` moves all the tld's children to the tld's parent using `os.rename()`. This edge case is tested in the new unit test `test_unzip_with_matching_subdirectory_names`. Additionally, `extract_file_and_remove_top_level_dir`'s unit test is replaced with `TestRemoveDirFromFilepaths`, which tests the new `remove_dir_from_filepaths` function.	2022-01-21 15:27:28 -06:00
shrekris-anyscale	75b3080834	[Serve] Serve Autoscaling Release tests (#21208 )	2022-01-21 12:08:25 -08:00
mwtian	c85546a884	[Test] increase timeout for `test_traceback.py` (#21765 ) `test_traceback.py` was taking ~55s to finish recently, and since today it starts to time out at 60s more frequently. All test cases do succeed so increase its test time out for now. We will look into if there is any performance regression separately.	2022-01-20 23:57:49 -08:00
SangBin Cho	5514711a35	[Part 5] Set actor died error message in ActorDiedError (#20903 ) This is the second last PR to improve `ActorDiedError` exception. This propagates Actor death cause metadata to the ray error object. In this way, we can raise a better actor died error exception. After this PR is merged, I will add more metadata to each error message and write a documentation that explains when each error happens. TODO - [x] Fix test failures - [x] Add unit tests - [x] Fix Java/cpp cases Follow up PRs - Not allowing nullptr for RayErrorInfo input.	2022-01-20 22:11:11 -08:00
mwtian	0dbe4b3a56	[Pubsub] fix driver warning for not keeping up with worker logs (#21717 ) GCS pubsub uses long polling, so the subscriber waits instead of returning None from polling when there is no buffered log. It needs a different heuristic to decide if the driver is not keeping up with logs from the worker.	2022-01-20 16:32:42 -08:00
xwjiang2010	9af8f11191	Revert "[docs] Clean up doc structure (first part) (#21667 )" (#21763 ) This reverts commit `38e46c9fb3`.	2022-01-20 15:30:56 -08:00
Yi Cheng	3c63a8410d	[gcs/ha] Fix java related error when enable redisless ray (#21692 ) This PR enables ray java to be able to run without redis. It also fixes java related tests and updated the pipeline.	2022-01-20 13:56:25 -08:00
Jiajun Yao	fa9feb5033	Fix replace_symlinks_with_junctions for windows (#21720 ) Windows cmd.exe doesn't interpret single quote correctly. See https://github.com/conda-forge/ray-packages-feedstock/pull/43	2022-01-20 12:38:56 -08:00
matthewdeng	976ba5dbfe	[train] fix fashion mnist example (#21689 ) Making some minor fixes. 1. Update input `batch_size` to be global batch size. Introduce `worker_batch_size` so each iteration trains same global batch size. 2. Update dataset `size` calculation to only refer to the fraction of the data that is trained on each worker. This allows calculations (e.g. training progress, accuracy) to be correct. 3. Add `model.train()` for generality. 4. Remove `smoke-test` flag since it's not really being used.	2022-01-20 12:26:02 -08:00
SangBin Cho	b6d3e01e0b	Revert "WINDOWS: enable passing metric tests (#21705 )" (#21738 ) This reverts commit `8104fd5c76`.	2022-01-20 07:27:49 -08:00
Max Pumperla	38e46c9fb3	[docs] Clean up doc structure (first part) (#21667 )	2022-01-20 16:19:04 +01:00
mwtian	a4581e58ee	[Pubsub] improve error handling for GCS AIO subscribers in dashboard (#21712 ) - Tolerate GRPC deadline exceeded and transient failures in Python GCS AIO subscribers, which becomes consistent with Python GCS synchronous subscribers. - Tolerate any exception in dashboard for subscribing to logs and error info, which becomes consistent with how dashboard handles GRPC errors for obtaining node stats.	2022-01-20 07:04:54 -08:00
Hao Chen	8dcc07ec9c	[Fix][Locality] ref count should remove object locations for dead nodes (#21548 ) When a node is dead, reference table should remove locations for those objects on the node. Otherwise locality-aware scheduling will schedule tasks to the dead node.	2022-01-20 11:58:52 +08:00
Philipp Moritz	fbc51d6d0e	[Kuberay] Ray Autoscaler integration with Kuberay (MVP) (#21086 ) This is a minimum viable product for Ray Autoscaler integration with Kuberay. It is not ready for prime time/general use, but should be enough for interested parties to get started (see the documentation in kuberay.md).	2022-01-19 19:42:17 -08:00
Wilson Wang	2626c64060	Fix monitor.py exceptions. Enable fetching GCS address from Redis with retries. (#21533 ) GCS, when running as an individual component, can cause other components to fail in case of crashes. Here are two main cases covered in this patch: 1. monitor.py will raise an exception when disconnected from GCS. 2. When GCS becomes available later than other components, the missing KV of GCS address can cause other components to fail to start. In our patch, we fixed these two issues as well as increased the timeout for redis connection which was too small. Co-authored-by: Mingwei Tian <mwtian@anyscale.com>	2022-01-19 18:48:03 -08:00
Matti Picus	8104fd5c76	WINDOWS: enable passing metric tests (#21705 )	2022-01-19 17:09:34 -08:00
Eric Liang	88143cdc35	[data] Unify key function type and error handling across sort, groupby, and agg (#21627 ) Prior to this PR, sort, groupby, and aggregate defined separate types for extracting values from Dataset records. This was confusing since the user had to understand the differences between the different key types (which were basically exactly the same). This PR defines a common key type: KeyFn, which is simply Union[None, str, Callable[[T], Any]]. This is used as sort(KeyFn...), aggregate(Agg(KeyFn)...), groupby(KeyFn).agg(Agg(KeyFn), ...). It also unifies the error generation paths to a common _validate_key_fn utility. This also improves the errors generated when passing explicit AggregateFn classes, which previously failed in the workers if invalid.	2022-01-19 11:15:13 -08:00
Yi Cheng	82103bf7c1	[gcs/ha] Fix cpp tests related to redis removal (#21628 ) This PR fixed cpp tests and also make ray cpp able to pass.	2022-01-19 01:26:34 -08:00
Kai Fricke	8fd5b7a5a8	Tune test autoscaler / fix stale node detection bug (#21516 ) See #21458. Currently, Tune keeps its own list of alive node IPs, but this information is only updated every 10 seconds and is usually stale when a new node is added. Because of this, the first trial scheduled on this node is usually marked as failed. This PR adds a test confirming this behavior and gets rid of the unneeded code path. Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>	2022-01-18 16:20:16 -08:00
dependabot[bot]	1f563aaf9b	[data](deps): Bump dask[complete] from 2021.11.0 to 2022.1.0 in /python/requirements/data_processing (#21621 ) Bumps [dask[complete]](https://github.com/dask/dask) from 2021.11.0 to 2022.1.0.	2022-01-18 15:32:07 -08:00
mwtian	ef9d9df4e7	[Doc] add comment for waiting for Ray to shutdown in `test_client_reconnect.py` (#21672 )	2022-01-18 12:06:08 -08:00
Jiajun Yao	fa5c167717	Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988 ) (#21661 ) This reverts commit `4a55d10bb1`.	2022-01-18 06:11:20 -08:00
mwtian	4faf3e1e31	[GCS] reenable test_client_reconnect.py for GCS HA builds (#21589 ) In test_client_reconnect.py, each test case starts a Ray cluster via client server's default_connect_handler(). The Ray cluster shuts down implicitly when the start_middleman_server() ended and Python GC'es the client server. After turning on GCS pubsub, the time when client server is GC'ed changes. Sometimes the Ray cluster from a previous test cases stays alive after the next test case starts and shuts down later, leading to test failures due to lost data or crashes (race during worker shutdown, will be investigated separately). This PR makes sure each test case shuts down its Ray cluster.	2022-01-17 23:08:47 -08:00
Guyang Song	c321e6e5bd	[script] support using hostname as node_ip_address (#20720 )	2022-01-18 11:05:50 +08:00
Gagandeep Singh	970b7b2a4b	Unskip tests from `ci.sh` (#21483 )	2022-01-17 15:22:57 -08:00
Qing Wang	a5cabb324b	Remove streaming deploying process. (#21603 ) 1. Remove the streaming from deploying to maven central. 2. Remove related streaming stuff from setup.py.	2022-01-17 23:37:48 +08:00
Yi Cheng	87d852fc28	[gcs/ha] Fix some tests failed in HA mode (#21587 ) This PR fixed and reenabled tests in HA mode - //python/ray/tests:test_healthcheck - //python/ray/tests:test_autoscaler_drain_node_api - //python/ray/tests:test_ray_debugger	2022-01-16 21:53:14 -08:00
Simon Mo	86bbf28e4c	[CI] Fix test_get_deployment and test_runtime_env_validation (#21637 )	2022-01-16 17:25:14 -08:00
Yi Cheng	927c5467eb	[gcs/function table] Change function table keys' prefix from binary to hex (#21616 ) When cleanup the function table, we use the prefix to delete the data. But right now prefix contains binary data and it won't work well with redis keys/scan which use `*` in the pattern. For example, when job id increases to 41, it'll delete the keys for job 1 which leads to the new worker failing to import the function. This PR uses hex of job id to avoid this.	2022-01-15 21:58:14 -08:00
Kai Fricke	d84154a774	[ci/multinode] Add utilities to kill nodes in multi node testing (#21580 ) Killing nodes enables advanced fault tolerance testing. This PR adds utilities and a test for this functionality in fake multinode docker mode.	2022-01-15 17:11:16 -08:00
Eric Liang	a971774820	Improve errors raised by ds.groupby() of unsupported key type (#21610 )	2022-01-15 16:35:31 -08:00
Kai Yang	4a55d10bb1	[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988 ) This PR adds pandas block format support by implementing `PandasRow`, `PandasBlockBuilder`, `PandasBlockAccessor`. Note that `sort_and_partition`, `combine`, `merge_sorted_blocks`, `aggregate_combined_blocks` in `PandasBlockAccessor` redirects to arrow block format implementation for now. They'll be implemented in a later PR. Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-01-15 17:28:34 +08:00
Archit Kulkarni	26057c433f	[CI] pin uvicorn to 0.16.0 to fix serve (#21612 )	2022-01-14 16:00:51 -08:00
Gagandeep Singh	f8bcb8aeb6	Unskipped tests in `test_actor.py` (#21501 )	2022-01-14 08:46:46 -08:00
Jialing He	ded4128ebf	[Core] dlmalloc allocate bottom-most memory chunk failed (#21439 ) Why are these changes needed? fix dlmalloc allocate bug, details in here #21310 * fix dlmalloc bug * make lint happy * make lint happy * fix by comment * use _check_spilled_mb * add cpp UT	2022-01-13 23:53:29 -08:00
Jiajun Yao	e0f4636477	Fix simple dataset sort generating only 1 non-empty block (#21588 )	2022-01-13 23:50:24 -08:00
Matti Picus	f4da0410b3	WINDOWS: unskip actor, component_failure, failure tests (#21492 ) Unskip windows tests that pass locally	2022-01-13 23:16:22 -08:00
Stephanie Wang	1df67eb977	[core] Avoid ObjectID collisions for re-executed tasks (#21395 ) If a task is re-executed on failure, it will deterministically generate the same IDs for any ray.put or .remote task calls because it uses its own task ID as a seed. This can cause problems if those objects conflict with previous versions that still exist in the cluster. This PR adds the execution attempt number to the current task ID seed. This avoids collisions with any ObjectIDs generated by the previous execution attempt of the task. Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>	2022-01-13 18:18:55 -08:00
Yi Cheng	e4ba51f25b	[core] Add GC for function table (#21509 ) In Ray, functions are exported to the function table during runtime. But it's not cleaned up after use. This PR garbage collects the resource when there is no job/detached actor referencing the resource. Ideally, we should move the function table imports/exports feature to core, so gcs function manager is introduced, and currently, it's for reference counting only.	2022-01-13 18:06:05 -08:00

1 2 3 4 5 ...

5914 commits