hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Zongheng Yang	5a50e80b63	Make Monitor remove dead Redis entries from exiting drivers. (#994 ) * WIP: removing OL, OI, TT on client exit; no saving yet. * ray_redis_module.cc: update header comment. * Cleanup: just the removal. * Reformat via yapf: use pep8 style instead of google. * Checkpoint addressing comments (partially) * Add 'b' marker before strings (py3 compat) * Add MonitorTest. * Use `isort` to sort imports. * Remove some loggings * Fix flake8 noqa marker runtest.py * Try to separate tests out to monitor_test.py * Rework cleanup algorithm: correct logic * Extend tests to cover multi-shard cases * Add some small comments and formatting changes.	2017-09-26 00:11:38 -07:00
gycn	a432285e77	Disable parallelization for Actors and ray.wait for debugging (#961 ) Support actors and ray.wait in PYTHON_MODE.	2017-09-17 00:12:50 -07:00
Stephanie Wang	74ac80631b	Local scheduler sends a null heartbeat to global scheduler (#962 ) * Local scheduler sends a null heartbeat to global scheduler to notify death * Add whitespace. * Speed up component failures test * Free local scheduler state upon plasma manager disconnection	2017-09-12 10:45:21 -07:00
Eric Liang	e17412a72b	fix free log std param (#964 )	2017-09-11 18:52:48 -07:00
Stephanie Wang	99c8b1f38c	Actor fault tolerance using object lineage reconstruction (#902 ) * Revert Python actor reconstruction * Actor reconstruction using object lineage * Add dummy arguments and return values for actor tasks * Pin dummy outputs for actor tasks * Skip checkpointing test for now * TODOs * minor edits * Generate dummy object dependencies in Python, not C * Fix linting. * Move actor counter and dummy objects inside of the actor handle * Refactor Worker._process_task, suppress exception propagation for sequential actor tasks	2017-09-10 19:29:28 -07:00
Eric Liang	d8aa826e63	[webui] Scalability fixes for the task timeline and visualizations (#935 ) * fixes * comments * fix test * Update ui.py * upd * Fix linting.	2017-09-10 15:47:44 -07:00
Eric Liang	1ebfe9608f	[rllib] Add downscale and frameskip options for Montezumas (#908 ) * up * update * fix * update * update * update * api break * Update run_multi_node_tests.sh * fix	2017-09-02 17:20:56 -07:00
Stephanie Wang	7496c98010	Fault tolerance race (#894 ) * Remove race between local scheduler disconnecting and global scheduler assigning a task * Fix number of workers started in component failures test * Fix race between global scheduler retrying a task assignment and monitor cleaning up task table. The global scheduler should only retry the task assignment if the local scheduler is still alive. * Clean up task_table_update callback if failure * Look up current local scheduler mapping when retrying actor task submission * Log warning if no subscribers received a task table update * Clean up database handle memory in local scheduler	2017-08-30 22:20:50 -07:00
Philipp Moritz	164a8f368e	[rllib] Rename algorithms (#890 ) * rename algorithms * fix * fix jenkins test * fix documentation * fix	2017-08-29 16:56:42 -07:00
Robert Nishihara	e1831792f8	For PPO, rename num_agents -> num_workers. (#882 )	2017-08-28 23:11:06 -07:00
Philipp Moritz	791bee343f	[rllib] Implement GAE for PPO (#849 ) * make information available for GAE * buggy version of GAE estimator * fix * add more logging and reweight losses * fix logging * fix loss * adapt advantage calculation * update gae * standardize returns * don't normalize td lambda ret * fix * don't standardize advantages * do standardization earlier * different standardization * initializer * drop into the debugger * fix tensorflow broadcasting bug * vf clipping * don't standardize tdlambdaret * different standardization * use huber loss for value function * refactor -- first half * it runs * fix * update * documentation * linting and tests * fix linting * naming * fix * linting * fix * remove prefix madness * fixes * fix * add value function example * fix linting * remove newline	2017-08-23 20:35:47 -07:00
Alexey Tumanov	fc885bd918	Adding basic support for a user-interpretable resource label (#761 ) * adding support for the user-interpretable label(UIR) * more plumbing for num_uirs further upstream; set to infty when specified on cmd line * pass default num_uirs for actors; update GlobalStateAPI * support num_uirs in ray.init() * local scheduler resource accounting: support num_uirs; prep for vectorized resource accounting * global scheduler test updated * Fix bug introduced by rebase. * Rename UIR -> CustomResource and add test. * Small changes and use constexpr instead of macros. * Linting and some renaming. * Reorder some code. * Remove cpus_in_use and fix bug. * Add another test and make a small change. * Rephrase documentation about feature stability.	2017-08-08 02:53:59 -07:00
Philipp Moritz	862e56000b	[rllib] Unify RLLib examples and add jenkins test for policy gradients (#815 ) * add jenkins test * correct handling of the number of iterations * convert policy gradient and evolution strategies script * convert DQN * fix A3C * fix * fix * fixes * remove redundant A3C example	2017-08-07 19:05:48 -07:00
Robert Nishihara	dbe3d9351c	Prototype actor checkpointing. (#814 ) * Initial testing of checkpointing functions. * Save checkpoints in Redis. * Pipe checkpoint_interval through remote decorator. * Add a test. * Small cleanups. * Submit dummy tasks when reconstructing tasks before the most recent tasks so that we don't end up reconstructing the arguments for those tasks. * Remove old checkpoints to save space. * Fix linting.	2017-08-07 17:52:39 -07:00
Robert Nishihara	d7b10a84b6	Fallback to custom serializer for very long python ints. (#821 ) * Fallback to custom serializer for very long python ints. * Fix linting. * Fix naming convention and add RETURN_NOT_OK.	2017-08-07 17:21:06 -07:00
Robert Nishihara	cb84972f6b	Recreate actors when local schedulers die. (#804 ) * Reconstruct actor state when local schedulers fail. * Simplify construction of arguments to pass into default_worker.py from local scheduler. * Remove deprecated ray.actor. * Simplify actor reconstruction method. * Fix linting. * Small fixes.	2017-08-02 18:02:52 -07:00
Robert Nishihara	52a27be364	Better logging in tests. (#790 )	2017-07-31 22:30:46 -07:00
Philipp Moritz	c3b39b4d86	Pull Plasma from Apache Arrow and remove Plasma store from Ray. (#692 ) * Rebase Ray on top of Plasma in Apache Arrow * add thirdparty building scripts * use rebased arrow * fix * fix build * fix python visibility * comment out C tests for now * fix multithreading * fix * reduce logging * fix plasma manager multithreading * make sure old and new object IDs can coexist peacefully * more rebasing * update * fixes * fix * install pyarrow * install cython * fix * install newer cmake * fix * rebase on top of latest arrow * getting runtest.py run locally (needed to comment out a test for that to work) * work on plasma tests * more fixes * fix local scheduler tests * fix global scheduler test * more fixes * fix python 3 bytes vs string * fix manager tests valgrind * fix documentation building * fix linting * fix c++ linting * fix linting * add tests back in * Install without sudo. * Set PKG_CONFIG_PATH in build.sh so that Ray can find plasma. * Install pkg-config * Link -lpthread, note that find_package(Threads) doesn't seem to work reliably. * Comment in testGPUIDs in runtest.py. * Set PKG_CONFIG_PATH when building pyarrow. * Pull apache/arrow and not pcmoritz/arrow. * Fix installation in docker image. * adapt to changes of the plasma api * Fix installation of pyarrow module. * Fix linting. * Use correct python executable to build pyarrow.	2017-07-31 21:04:15 -07:00
Robert Nishihara	37dafa4d14	Simplify put test and move it to failure tests. (#788 )	2017-07-31 17:57:48 -07:00
Robert Nishihara	1fe49d7676	Simplify testMultipleLocalSchedulers by having it start only one worker. (#789 )	2017-07-31 17:44:45 -07:00
Eric Liang	b6a18cb39b	[rllib] Also refactor DQN to use shared RLlib models (#730 ) * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * Minor indentation changes. * fix merge * add humanoid * initial dqn refactor * remove tfutil * fix calls * fix tf errors 1 * closer * runs now * lint * tensorboard graph * fix linting * more 4 space * fix * fix linT * more lint * oops * es parity * remove example.py * fix training bug * add cartpole demo * try fixing cartpole * allow model options, configure cartpole * debug * simplify * no dueling * avoid out of file handles * Test dqn in jenkins. * Minor formatting. * fix issue * fix another * Fix problem in which we log to a directory that hasn't been created.	2017-07-26 12:29:00 -07:00
Robert Nishihara	13000b7503	Start processes using the same version of Python that was used to start Ray. (#760 ) * Make local scheduler start workers using the same version of Python that was used to start the local scheduler. * Use current version of python to start new processes instead of hardcoded python executable. * Fix linting.	2017-07-21 00:05:10 +00:00
alanamarzoev	2b3190ad13	Chrome trace timeline with sliders. (#731 ) * Trace timeline with sliders. * Trace. * Switched ujson to json. * Fixed tests. * linting fixes * Fixed bug. * Cleaned up code. * Fixes according to comments. * removed checkpoints. * Undid accidental delete. * Fixed linting error. * Added documentation to notebook. * Undid accidental deletes. * Add comments and small formatting fixes. * Small fix.	2017-07-17 19:59:49 -07:00
Robert Nishihara	80e8426b5e	Test example applications and rllib in jenkins tests. (#707 ) * Test example applications in Jenkins. * Fix default upload_dir argument for Algorithm class. * Fix evolution strategies. * Comment out policy gradient example which doesn't seem to work. * Set --env-name for evolution strategies.	2017-07-16 18:51:33 +00:00
Robert Nishihara	e0867c8845	Switch Python indentation from 2 spaces to 4 spaces. (#726 ) * 4 space indentation for actor.py. * 4 space indentation for worker.py. * 4 space indentation for more files. * 4 space indentation for some test files. * Check indentation in Travis. * 4 space indentation for some rl files. * Fix failure test. * Fix multi_node_test. * 4 space indentation for more files. * 4 space indentation for remaining files. * Fixes.	2017-07-13 21:53:57 +00:00
alanamarzoev	8464d77c76	Change event logs to store one Redis ZSET per worker. (#705 ) * Changing to zset * Fixed bug. * Fixed another bug. * Modified task_profiles. * Removed extra file. * Modified task_profiles test. * WIP * WIP * Undid changes * Updated * WIP * Made changes according to comments. * Removed unneeded print. * Removed ujson usage. * failing test * tests passing * Fixed linting errors and modified style. * Fixed bug. * Fixed linting * Fixed according to comments. * Redis crashing? * Fixed linting * Fixed linting	2017-07-09 01:42:29 +02:00
alanamarzoev	716469160e	Enable dumping profiling information to timeline format viewable by chrome tracing. (#703 ) * Chrome tracing timeline. * Modified decode statement. * Some cleanups and add test. * Remove example. * Fix test.	2017-06-30 12:14:11 -04:00
alanamarzoev	e16df6da9a	Updated task_profiles function to avoid future repetitive parsing. (#691 ) * Updated task_profiles function to avoid future repetitive parsing. * Fix indentation. * Fixed according to comments. * Included updated test for task_profiles function. * Simplify test. * Fix indentation. * Fix.	2017-06-22 19:21:18 -07:00
Robert Nishihara	5ebc2f3f2e	Do resource bookkeeping for actor methods. (#682 ) * Dispatch regular and actor tasks when resources become available. * Make actor methods do resource bookkeeping and add test. * Remove unnecessary field. * Fix linting. * Fix actor test. * Maintain set of actors with pending tasks to speed up task dispatch. * Exit early from task dispatch if there are no resources available. * Fix linting. * Fix error. * Fix bug related to iterator invalidation. * When an actor is removed, remove it from the set of actors with pending tasks.	2017-06-21 05:52:45 +00:00
Robert Nishihara	019ba07e9c	Correct actor class name and module. (#675 ) * Correct actor class name and module. * Add test. * Fix linting.	2017-06-17 05:44:42 +00:00
Philipp Moritz	8798f4e690	fix flaky mac os x plasma store component_failure_test (#673 ) Fix flaky mac os x plasma store component_failure_test.	2017-06-15 00:31:50 -07:00
alanamarzoev	cc4990b543	Task profiles function and test (#647 ) Expose some task profiling information through global state API.	2017-06-13 17:53:34 -07:00
Philipp Moritz	54925996ca	Allow remote functions to specify max executions and kill worker once limit is reached. (#660 ) * implement restarting workers after certain number of task executions * Clean up python code. * Don't start new worker when an actor disconnects. * Move wait_for_pid_to_exit to test_utils.py. * Add test. * Fix linting errors. * Fix linting. * Fix typo.	2017-06-13 00:34:58 -07:00
Eric Liang	d4d2c03ac5	Remove timeout for Redis commands. (#649 ) * update * Remove interaction between callback data identifier and event loop. * Remove tests that no longer apply.	2017-06-09 15:55:36 -07:00
alanamarzoev	f0339f3386	Expose log files through global state API. (#641 ) * added log_table function and a test * fixed log_files and added task_profiles * fixed formatting * fixed linting errors * fixes * removed file * more fixes * hopefully fixed * Small changes. * Fix linting. * Fix bug in log monitor. * Small changes. * Fix bug in travis.	2017-06-08 00:08:10 -07:00
Philipp Moritz	6adf39959c	put back large python object tests (commented out) (#636 )	2017-06-02 20:36:10 -07:00
Robert Nishihara	2694337c0f	Fix large memory tests. (#632 ) * Log the driver ID in hex instead of binary. * Fix large memory test and add more tests to it. * Remove tests that are too stressful.	2017-06-03 01:12:56 +00:00
Robert Nishihara	1a682e2807	Enable starting and stopping ray with "ray start" and "ray stop". (#628 ) * Install start_ray and stop_ray scripts in setup.py. * Update documentation. * Fix docker tests. * Implement stop_ray script in python. * Fix linting.	2017-06-02 20:17:48 +00:00
Robert Nishihara	d0bfc0a849	Clean up actor workers when actor handle goes out of scope. (#617 )	2017-06-01 07:02:43 +00:00
Robert Nishihara	bcaab78908	Add script for building MacOS wheels. (#601 ) * Add script for building MacOS wheels. * Small cleanups to script. * Fix setting of PATH before building wheel. * Create symbolic link to correct Python executable so Ray installation finds the right Python. * Address comments. * Rename readme.	2017-06-01 00:30:46 +00:00
Philipp Moritz	b94b4a35e0	Make the Plasma store ready for Arrow integration (#579 ) * port plasma to arrow * fixes * refactor plasma client * more modernization * fix plasma manager tests * everything compiles * fix plasma client tests * update plasma serialization tests * fix plasma manager tests * fix bug * updates * fix bug * fix tests * fix rebase * address comments * fix travis valgrind build * fix linting * fix include order again * fix linting * address comments	2017-05-31 16:24:23 -07:00
Philipp Moritz	647e1d9fc3	Fix runtest.py on the ubuntu system python 3 (#599 ) * fix runtest.py on the ubuntu system python 3 * less strict version of the test	2017-05-26 15:22:36 -07:00
Robert Nishihara	c5bc76193f	Remove Ray environment variables from codebase. (#590 )	2017-05-24 18:29:40 -07:00
Robert Nishihara	c647dd5f6c	Make it possible to use actor definitions within remote functions and other actors. (#587 ) * Enable remote function and actor definitions to close over actor definitions. * Give better error message if actor objects are pickled. * Add tests for closing over actor definitions. * Fix linting.	2017-05-24 15:43:32 -07:00
Robert Nishihara	07b21e057c	Print the driver stdout/stderr if we fail to decode it in jenkins. (#567 ) * Print the driver stdout/stderr if we fail to decode it in jenkins. * Fix whitespace. * Add explanation.	2017-05-20 23:11:19 -07:00
Stephanie Wang	ee08c8274b	Shard Redis. (#539 ) * Implement sharding in the Ray core * Single node Python modifications to do sharding * Do the sharding in redis.cc * Pipe num_redis_shards through start_ray.py and worker.py. * Use multiple redis shards in multinode tests. * first steps for sharding ray.global_state * Fix problem in multinode docker test. * fix runtest.py * fix some tests * fix redis shard startup * fix redis sharding * fix * fix bug introduced by the map-iterator being consumed * fix sharding bug * shard event table * update number of Redis clients to be 64K * Fix object table tests by flushing shards in between unit tests * Fix local scheduler tests * Documentation * Register shard locations in the primary shard * Add plasma unit tests back to build * lint * lint and fix build * Fix * Address Robert's comments * Refactor start_ray_processes to start Redis shard * lint * Fix global scheduler python tests * Fix redis module test * Fix plasma test * Fix component failure test * Fix local scheduler test * Fix runtest.py * Fix global scheduler test for python3 * Fix task_table_test_and_update bug, from actor task table submission race * Fix jenkins tests. * Retry Redis shard connections * Fix test cases * Convert database clients to DBClient struct * Fix race condition when subscribing to db client table * Remove unused lines, add APITest for sharded Ray * Fix * Fix memory leak * Suppress ReconstructionTests output * Suppress output for APITestSharded * Reissue task table add/update commands if initial command does not publish to any subscribers. * fix * Fix linting. * fix tests * fix linting * fix python test * fix linting	2017-05-18 17:40:41 -07:00
shane	0a4304725f	adding -x for clearer output in build console log (#565 )	2017-05-18 17:04:56 -07:00
Philipp Moritz	28f0882387	Expose function table to python global control state API (#542 ) * expose function table to python global control state API * fix * fix linting * add test for function table	2017-05-16 20:06:13 -07:00
Robert Nishihara	ec2534422b	Remove register_class from API. (#550 ) * Perform ray.register_class under the hood. * Fix bug. * Release worker lock when waiting for imports to arrive in get. * Remove calls to register_class from examples and tests. * Clear serialization state between tests. * Fix bug and add test for multiple custom classes with same name. * Fix failure test. * Fix linting and cleanups to python code. * Fixes to documentation. * Implement recursion depth for recursively registering classes. * Fix linting. * Push warning to user if waiting for class for too long. * Fix typos. * Don't export FunctionToRun if pickling the function fails. * Don't broadcast class definition when pickling class.	2017-05-16 18:38:52 -07:00
Eric Liang	e2e9e4ce6f	Fix segmentation fault when calling ray.put on a dictionary with object keys (#548 ) * fix segfault when serializing dict key * fix style * fix test * Fix linting.	2017-05-15 01:09:13 -07:00

1 2 3 4 5

249 commits