hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Stephanie Wang	1323e1753d	[core] When reconstruction is enabled, pin objects created by ray.put() (#8021 ) * Unit test and pin ray.put objects until they have no more lineage references * c++ tests * lint * Mark ray.put objects as pinned	2020-04-20 13:09:54 -07:00
Eric Liang	17e3c545d9	[rllib] Fix truncate episodes mode in central critic example (#8073 )	2020-04-20 12:58:01 -07:00
Sven Mika	3812bfedda	[RLlib] PyTorch version of ES (Evolution Strategies). (#8104 ) PyTorch version of Evolution Strategies (ES) Algo.	2020-04-20 21:47:28 +02:00
Richard Liaw	9f3e9e7e9f	[tune] Add more intensive tests (#7667 ) * make_heavier_tests * help	2020-04-20 11:14:44 -07:00
Edward Oakes	793e616a2d	Fix job table parsing (#8070 )	2020-04-20 12:56:43 -05:00
Bill Chambers	77655749fb	[RayServe] RayServe Introduction and Overview (#8038 )	2020-04-20 12:05:59 -05:00
Sven Mika	d6cb7d865e	[RLlib] Torch DQN (APEX) TD-Error/prio. replay fixes. (#8082 ) PyTorch APEX_DQN with Prioritized Replay enabled would not work properly due to the td_error not being retrievable by the AsyncReplayOptimizer.	2020-04-20 10:03:25 +02:00
mehrdadn	c8b9a357f2	Try to fix dependency issue (#8065 ) Co-authored-by: Mehrdad <noreply@github.com>	2020-04-19 16:09:29 -07:00
ZhuSenlin	3f28a8a229	[GCS] reply to the owner only after the actor has been successfully created. (#8079 ) * reply to the owner only after the actor is successfully created. * reply immediately if the actor is already created * fix comment * add test_actor_creation_task provided by @Stephanie Wang Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>	2020-04-19 09:53:02 -07:00
Edward Oakes	da296bf8c5	[serve] Router fault tolerance (#8008 )	2020-04-19 11:04:06 -05:00
Sven Mika	165a86f1ab	[RLlib] SAC MuJoCo instability issues (tf and torch versions). (#8063 ) SAC (both torch and tf versions) are showing issues (crashes) due to numeric instabilities in the SquashedGaussian distribution (sampling + logp after extreme NN outputs). This PR fixes these. Stable MuJoCo learning (HalfCheetah) has been confirmed on both tf and torch versions. A Distribution stability test (using extreme NN outputs) has been added for SquashedGaussian (can be used for any other type of distribution as well).	2020-04-19 10:20:23 +02:00
Sumanth Ratna	bdb03a0544	[tune] Update dragonfly installation instructions (#8086 ) Closes #8084	2020-04-18 20:25:38 -07:00
Dean Wampler	5d2885c609	Minor Ray API doc refinements (#8060 ) * Added small section on installation when using Anaconda. Also fixed an obsolete link to Anaconda. * Delete more temporary directories when running the doc "make clean". * Fine-tuning the core Ray API documentation * Fix doc lines that were too long Co-authored-by: Dean Wampler <dean@concurrentthought.com>	2020-04-18 15:19:35 -07:00
Eric Liang	d92c5f1a9e	[rllib] Add init file for exec module	2020-04-17 17:24:28 -07:00
Richard Liaw	857e4dba2f	[sgd] HuggingFace GLUE Fine-tuning Example (#7792 ) * Init fp16 * fp16 and schedulers * scheduler linking and fp16 * to fp16 * loss scaling and documentation * more documentation * add tests, refactor config * moredocs * more docs * fix logo, add test mode, add fp16 flag * fix tests * fix scheduler * fix apex * improve safety * fix tests * fix tests * remove pin memory default * rm * fix * Update doc/examples/doc_code/raysgd_torch_signatures.py * fix * migrate changes from other PR * ok thanks * pass * signatures * lint' * Update python/ray/experimental/sgd/pytorch/utils.py * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * should address most comments * comments * fix this ci * first_pass * add overrides * override * fixing up operators * format * sgd * constants * rm * revert * save * failures * fixes * trainer * run test * operator * code * op * ok done * operator * sgd test fixes * ok * trainer * format * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * Update doc/source/raysgd/raysgd_pytorch.rst * docstring * dcgan * doc * commits * nit * testing * revert * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * benchmarks * rename * remove some args * better metrics output * fix up the benchmark * benchmark-yaml * horovod-benchmark * benchmarks * Remove benchmark code for cleanups * benchmark-code * nits * benchmark yamls * benchmark yaml * ok * ok * ok * benchmark * nit * finish_bench * makedatacreator * relax * metrics * autosetsampler * profile * movements * OK * smoothen * fix * nitdocs * loss * envflag * comments * nit * format * visible * images * move_images * fix * rernder * rrender * rest * multgpu * fix * nit * finish * extrra * setup * experimental * as_trainable * fix * ok * format * create_torch_pbt * setup_pbt * ok * format * ok * format * docs * ok * Draft head-is-worker * Fix missing concurrency between local and remote workers * Fix tqdm to work with head-is-worker * Cleanup * Implement state_dict and load_state_dict * Reserve resources on the head node for the local worker * Update the development cluster setup * Add spot block reservation to the development yaml * ok * Draft the fault tolerance fix * Small fixes to local-remote concurrency * Cleanup + fix typo * fixes * worker_counts * some formatting and asha * fix * okme * fixactorkill * unify * Revert the cluster mounts * Cut the handler-reporter API * Fix most tests * Rm tqdm_handler.py * Re-add tune test * Automatically force-shutdown on actor errors on shutdown * Formatting * fix_tune_test * Add timeout error verification * Rename tqdm to use_tqdm * fixtests * ok * remove_redundant * deprecated * deactivated * ok_try_this * lint * nice * done * retries * fixes * kill * retry * init_transformer * init * deployit * improve_example * trans * rename * formats * format-to-py37 * time_to_test * more_changes * ok * update_args_and_script * fp16_epoch * huggingface * training stats * distributed * Apply suggestions from code review * transformer Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Maksim Smolin <maximsmol@gmail.com>	2020-04-17 15:17:30 -07:00
Maksim Smolin	d6f4e5b3e1	[SGD] Imagenet example (basic) (#8020 ) * Checkpoint the image-models example * Update cluster definition * Fix copyright info * Use original args * Checkpoint fixes * Add README * Add some missing features * Format * Get rid of the unused Namespace class * Address comments * Link the imagenet example in docs * Cleanup * Fix lint	2020-04-17 13:33:55 -07:00
Edward Oakes	90ef585fd5	Revert "Add ability to specify worker and driver ports (#7833 )" (#8069 ) This reverts commit `9f751ff8c4`.	2020-04-17 12:32:22 -05:00
mehrdadn	8cf37726d2	Fix missing Java dependency (#8067 ) Co-authored-by: Mehrdad <noreply@github.com>	2020-04-17 10:43:02 -05:00
mehrdadn	f15618033d	Remove --no-transfer-progress as it appears to be unsupported (#8066 ) Co-authored-by: Mehrdad <noreply@github.com>	2020-04-17 16:30:37 +02:00
Sven Mika	f7e4dae852	[RLlib] DQN and SAC Atari benchmark fixes. (#7962 ) * Add Atari SAC-discrete (learning MsPacman in 40k ts up to 780 rewards). * SAC loss function test case fix.	2020-04-17 08:49:15 +02:00
Richard Liaw	a9ea139317	[sgd] Make serialization of data creation optional (#8027 ) * pytest * Update python/ray/util/sgd/torch/torch_trainer.py Co-Authored-By: Ujval Misra <misraujval@gmail.com> Co-authored-by: Ujval Misra <misraujval@gmail.com>	2020-04-16 20:27:51 -07:00
Richard Liaw	de1787e5e5	[tune] Check actor start -> test_cluster (#8056 ) * test * info * ok * hard_stop * codefix	2020-04-16 20:00:45 -07:00
Mitchell Stern	d0c6f013c3	Fix command config portion of project schema (#8057 )	2020-04-16 18:08:17 -07:00
Richard Liaw	6545534805	[tune/sgd] DCGAN example self-contained, turn example into modu… (#8012 ) * ok * done * run_benchmarks * should_make_examples_usable	2020-04-16 17:55:27 -07:00
Eric Liang	0c80efa2a3	[rllib] Disable explicit free, which is no longer needed and causes memory leaks	2020-04-16 16:06:58 -07:00
roireshef	dbcad35022	[RLlib] Added DefaultCallbacks which replaces old callbacks dict interface (#6972 )	2020-04-16 16:06:42 -07:00
mehrdadn	35ae7f0e68	[CI] Preload Test to Skip Env Var to All Travis Job (#8061 ) Co-authored-by: Mehrdad <noreply@github.com>	2020-04-16 15:37:25 -07:00
Karthikeyan Singaravelan	f95e18dfeb	[tune/sgd] Import ABC from collections.abc instead of collectio… (#7982 ) * Import ABC from collections.abc instead of collections for Python 3 compatibility. * Fix linter errors.	2020-04-16 15:26:49 -07:00
mehrdadn	42f88ecf9d	Hotfix CI Export Tests to Skip (#8058 ) Co-authored-by: Mehrdad <noreply@github.com>	2020-04-16 15:23:00 -07:00
Richard Liaw	118d960e1c	[hotfix] Java Lint Broken (#8048 )	2020-04-16 13:58:33 -07:00
Richard Liaw	2cb3355495	[docs] Move css to right location (#8053 )	2020-04-16 13:46:50 -07:00
Eric Liang	55ce2bba10	Record num plasma errs in map (#8034 )	2020-04-16 13:16:40 -07:00
Edward Oakes	9f751ff8c4	Add ability to specify worker and driver ports (#7833 )	2020-04-16 13:49:25 -05:00
Richard Liaw	d5f517b2f5	[docs] Hotfix for missing css files. (#8051 )	2020-04-16 11:44:55 -07:00
Richard Liaw	4d8bf5635d	[hotfix] Lint formatting for new Tune optimizer ZOOpt (#8040 ) * formatting * removedill * lint	2020-04-16 09:24:30 -07:00
Clark Zinzow	d4cae5f632	[Core] Added ability to specify different IP addresses for a core worker and its raylet. (#7985 )	2020-04-16 10:32:24 -05:00
Sven Mika	d0fab84e4d	[RLlib] DDPG PyTorch version. (#7953 ) The DDPG/TD3 algorithms currently do not have a PyTorch implementation. This PR adds PyTorch support for DDPG/TD3 to RLlib. This PR: - Depends on the re-factor PR for DDPG (Functional Algorithm API). - Adds learning regression tests for the PyTorch version of DDPG and a DDPG (torch) - Updates the documentation to reflect that DDPG and TD3 now support PyTorch. * Learning Pendulum-v0 on torch version (same config as tf). Wall time a little slower (~20% than tf). * Fix GPU target model problem.	2020-04-16 10:20:01 +02:00
Xianyang Liu	e1d3f7eba6	[rllib]Add config for rllib to support set python environments (#8026 ) * support set extra python environments * wrap value with str * Apply suggestions from code review Co-Authored-By: Eric Liang <ekhliang@gmail.com> * addresses comments * fix lint errors * remove unrelated changes due to format.sh * remove unrelated changes due to format.sh Co-authored-by: Eric Liang <ekhliang@gmail.com>	2020-04-16 01:13:45 -07:00
wanxing	9345d03ffb	[Streaming] Streaming data transfer supports cross language. (#7961 ) * add init parameters for java * fix bug * cython * fix compile * fix test_direct_tranfer * comment * ChannelCreationParameter * fix comment * builder * lint and fix tests * fix single process test * fix checkstyle and lint * checkstyle * lint python Co-authored-by: wanxing <wanxing@B-458DMD6M-1753.local>	2020-04-16 15:16:48 +08:00
fangfengbin	5a7882bb44	Fix gcs_server get invalid local address (#7842 )	2020-04-16 14:58:19 +08:00
JianZhangYang	7b0518b993	[streaming] Async changes for resourcemanager part (#7955 )	2020-04-16 14:15:45 +08:00
Servon	5c274fe631	[Tune] Add ZOOpt search algorithm (#7960 ) * add zoopt * add zoopt search algo * add zoopt * fix zoopt * add zoopt requirements * fix zoopt * remove generated guides * Apply suggestions from code review Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-04-15 21:13:29 -07:00
mehrdadn	956ea7c944	Hotfix CI determine_tests_to_run (#8039 )	2020-04-15 17:00:38 -07:00
Simon Mo	7455610d5a	Serve Doc: Quickstart (#7940 )	2020-04-15 12:25:37 -07:00
mehrdadn	ba00c29b67	Factor out Travis 'install' sections for use with GitHub Actions (#7988 )	2020-04-15 08:10:22 -07:00
Sven Mika	428516056a	[RLlib] SAC Torch (incl. Atari learning) (#7984 ) * Policy-classes cleanup and torch/tf unification. - Make Policy abstract. - Add `action_dist` to call to `extra_action_out_fn` (necessary for PPO torch). - Move some methods and vars to base Policy (from TFPolicy): num_state_tensors, ACTION_PROB, ACTION_LOGP and some more. * Fix `clip_action` import from Policy (should probably be moved into utils altogether). * - Move `is_recurrent()` and `num_state_tensors()` into TFPolicy (from DynamicTFPolicy). - Add config to all Policy c'tor calls (as 3rd arg after obs and action spaces). * Add `config` to c'tor call to TFPolicy. * Add missing `config` to c'tor call to TFPolicy in marvil_policy.py. * Fix test_rollout_worker.py::MockPolicy and BadPolicy classes (Policy base class is now abstract). * Fix LINT errors in Policy classes. * Implement StatefulPolicy abstract methods in test cases: test_multi_agent_env.py. * policy.py LINT errors. * Create a simple TestPolicy to sub-class from when testing Policies (reduces code in some test cases). * policy.py - Remove abstractmethod from `apply_gradients` and `compute_gradients` (these are not required iff `learn_on_batch` implemented). - Fix docstring of `num_state_tensors`. * Make QMIX torch Policy a child of TorchPolicy (instead of Policy). * QMixPolicy add empty implementations of abstract Policy methods. * Store Policy's config in self.config in base Policy c'tor. * - Make only compute_actions in base Policy's an abstractmethod and provide pass implementation to all other methods if not defined. - Fix state_batches=None (most Policies don't have internal states). * Cartpole tf learning. * Cartpole tf AND torch learning (in ~ same ts). * Cartpole tf AND torch learning (in ~ same ts). 2 * Cartpole tf (torch syntax-broken) learning (in ~ same ts). 3 * Cartpole tf AND torch learning (in ~ same ts). 4 * Cartpole tf AND torch learning (in ~ same ts). 5 * Cartpole tf AND torch learning (in ~ same ts). 6 * Cartpole tf AND torch learning (in ~ same ts). Pendulum tf learning. * WIP. * WIP. * SAC torch learning Pendulum. * WIP. * SAC torch and tf learning Pendulum and Cartpole after cleanup. * WIP. * LINT. * LINT. * SAC: Move policy.target_model to policy.device as well. * Fixes and cleanup. * Fix data-format of tf keras Conv2d layers (broken for some tf-versions which have data_format="channels_first" as default). * Fixes and LINT. * Fixes and LINT. * Fix and LINT. * WIP. * Test fixes and LINT. * Fixes and LINT. Co-authored-by: Sven Mika <sven@Svens-MacBook-Pro.local>	2020-04-15 13:25:16 +02:00
fangfengbin	efbaf155b2	[GCS]Add publish and subscribe function of gcs table (#7909 )	2020-04-15 04:24:52 -07:00
Qing Wang	dfb0ad0d3e	[Java] Fix Java CI exit code issue (#8028 )	2020-04-15 15:28:52 +08:00
Jan Blumenkamp	8e439688fc	Torch sequence_mask now works for tensors on different devices (#7980 )	2020-04-15 07:21:51 +02:00
fangfengbin	c17404918c	[GCS]Add gcs table storage interface (#7949 )	2020-04-15 10:48:12 +08:00

... 4 5 6 7 8 ...

4701 commits