Commit graph

227 commits

Author SHA1 Message Date
alanamarzoev
2b3190ad13 Chrome trace timeline with sliders. (#731)
* Trace timeline with sliders.

* Trace.

* Switched ujson to json.

* Fixed tests.

* linting fixes

* Fixed bug.

* Cleaned up code.

* Fixes according to comments.

* removed checkpoints.

* Undid accidental delete.

* Fixed linting error.

* Added documentation to notebook.

* Undid accidental deletes.

* Add comments and small formatting fixes.

* Small fix.
2017-07-17 19:59:49 -07:00
Robert Nishihara
80e8426b5e Test example applications and rllib in jenkins tests. (#707)
* Test example applications in Jenkins.

* Fix default upload_dir argument for Algorithm class.

* Fix evolution strategies.

* Comment out policy gradient example which doesn't seem to work.

* Set --env-name for evolution strategies.
2017-07-16 18:51:33 +00:00
Robert Nishihara
e0867c8845 Switch Python indentation from 2 spaces to 4 spaces. (#726)
* 4 space indentation for actor.py.

* 4 space indentation for worker.py.

* 4 space indentation for more files.

* 4 space indentation for some test files.

* Check indentation in Travis.

* 4 space indentation for some rl files.

* Fix failure test.

* Fix multi_node_test.

* 4 space indentation for more files.

* 4 space indentation for remaining files.

* Fixes.
2017-07-13 21:53:57 +00:00
alanamarzoev
8464d77c76 Change event logs to store one Redis ZSET per worker. (#705)
* Changing to zset

* Fixed bug.

* Fixed another bug.

* Modified task_profiles.

* Removed extra file.

* Modified task_profiles test.

* WIP

* WIP

* Undid changes

* Updated

* WIP

* Made changes according to comments.

* Removed unneeded print.

* Removed ujson usage.

* failing test

* tests passing

* Fixed linting errors and modified style.

* Fixed bug.

* Fixed linting

* Fixed according to comments.

* Redis crashing?

* Fixed linting

* Fixed linting
2017-07-09 01:42:29 +02:00
alanamarzoev
716469160e Enable dumping profiling information to timeline format viewable by chrome tracing. (#703)
* Chrome tracing timeline.

* Modified decode statement.

* Some cleanups and add test.

* Remove example.

* Fix test.
2017-06-30 12:14:11 -04:00
alanamarzoev
e16df6da9a Updated task_profiles function to avoid future repetitive parsing. (#691)
* Updated task_profiles function to avoid future repetitive parsing.

* Fix indentation.

* Fixed according to comments.

* Included updated test for task_profiles function.

* Simplify test.

* Fix indentation.

* Fix.
2017-06-22 19:21:18 -07:00
Robert Nishihara
5ebc2f3f2e Do resource bookkeeping for actor methods. (#682)
* Dispatch regular and actor tasks when resources become available.

* Make actor methods do resource bookkeeping and add test.

* Remove unnecessary field.

* Fix linting.

* Fix actor test.

* Maintain set of actors with pending tasks to speed up task dispatch.

* Exit early from task dispatch if there are no resources available.

* Fix linting.

* Fix error.

* Fix bug related to iterator invalidation.

* When an actor is removed, remove it from the set of actors with pending tasks.
2017-06-21 05:52:45 +00:00
Robert Nishihara
019ba07e9c Correct actor class name and module. (#675)
* Correct actor class name and module.

* Add test.

* Fix linting.
2017-06-17 05:44:42 +00:00
Philipp Moritz
8798f4e690 fix flaky mac os x plasma store component_failure_test (#673)
Fix flaky mac os x plasma store component_failure_test.
2017-06-15 00:31:50 -07:00
alanamarzoev
cc4990b543 Task profiles function and test (#647)
Expose some task profiling information through global state API.
2017-06-13 17:53:34 -07:00
Philipp Moritz
54925996ca Allow remote functions to specify max executions and kill worker once limit is reached. (#660)
* implement restarting workers after certain number of task executions

* Clean up python code.

* Don't start new worker when an actor disconnects.

* Move wait_for_pid_to_exit to test_utils.py.

* Add test.

* Fix linting errors.

* Fix linting.

* Fix typo.
2017-06-13 00:34:58 -07:00
Eric Liang
d4d2c03ac5 Remove timeout for Redis commands. (#649)
* update

* Remove interaction between callback data identifier and event loop.

* Remove tests that no longer apply.
2017-06-09 15:55:36 -07:00
alanamarzoev
f0339f3386 Expose log files through global state API. (#641)
* added log_table function and a test

* fixed log_files and added task_profiles

* fixed formatting

* fixed linting errors

* fixes

* removed file

* more fixes

* hopefully fixed

* Small changes.

* Fix linting.

* Fix bug in log monitor.

* Small changes.

* Fix bug in travis.
2017-06-08 00:08:10 -07:00
Philipp Moritz
6adf39959c put back large python object tests (commented out) (#636) 2017-06-02 20:36:10 -07:00
Robert Nishihara
2694337c0f Fix large memory tests. (#632)
* Log the driver ID in hex instead of binary.

* Fix large memory test and add more tests to it.

* Remove tests that are too stressful.
2017-06-03 01:12:56 +00:00
Robert Nishihara
1a682e2807 Enable starting and stopping ray with "ray start" and "ray stop". (#628)
* Install start_ray and stop_ray scripts in setup.py.

* Update documentation.

* Fix docker tests.

* Implement stop_ray script in python.

* Fix linting.
2017-06-02 20:17:48 +00:00
Robert Nishihara
d0bfc0a849 Clean up actor workers when actor handle goes out of scope. (#617) 2017-06-01 07:02:43 +00:00
Robert Nishihara
bcaab78908 Add script for building MacOS wheels. (#601)
* Add script for building MacOS wheels.

* Small cleanups to script.

* Fix setting of PATH before building wheel.

* Create symbolic link to correct Python executable so Ray installation finds the right Python.

* Address comments.

* Rename readme.
2017-06-01 00:30:46 +00:00
Philipp Moritz
b94b4a35e0 Make the Plasma store ready for Arrow integration (#579)
* port plasma to arrow

* fixes

* refactor plasma client

* more modernization

* fix plasma manager tests

* everything compiles

* fix plasma client tests

* update plasma serialization tests

* fix plasma manager tests

* fix bug

* updates

* fix bug

* fix tests

* fix rebase

* address comments

* fix travis valgrind build

* fix linting

* fix include order again

* fix linting

* address comments
2017-05-31 16:24:23 -07:00
Philipp Moritz
647e1d9fc3 Fix runtest.py on the ubuntu system python 3 (#599)
* fix runtest.py on the ubuntu system python 3

* less strict version of the test
2017-05-26 15:22:36 -07:00
Robert Nishihara
c5bc76193f Remove Ray environment variables from codebase. (#590) 2017-05-24 18:29:40 -07:00
Robert Nishihara
c647dd5f6c Make it possible to use actor definitions within remote functions and other actors. (#587)
* Enable remote function and actor definitions to close over actor definitions.

* Give better error message if actor objects are pickled.

* Add tests for closing over actor definitions.

* Fix linting.
2017-05-24 15:43:32 -07:00
Robert Nishihara
07b21e057c Print the driver stdout/stderr if we fail to decode it in jenkins. (#567)
* Print the driver stdout/stderr if we fail to decode it in jenkins.

* Fix whitespace.

* Add explanation.
2017-05-20 23:11:19 -07:00
Stephanie Wang
ee08c8274b Shard Redis. (#539)
* Implement sharding in the Ray core

* Single node Python modifications to do sharding

* Do the sharding in redis.cc

* Pipe num_redis_shards through start_ray.py and worker.py.

* Use multiple redis shards in multinode tests.

* first steps for sharding ray.global_state

* Fix problem in multinode docker test.

* fix runtest.py

* fix some tests

* fix redis shard startup

* fix redis sharding

* fix

* fix bug introduced by the map-iterator being consumed

* fix sharding bug

* shard event table

* update number of Redis clients to be 64K

* Fix object table tests by flushing shards in between unit tests

* Fix local scheduler tests

* Documentation

* Register shard locations in the primary shard

* Add plasma unit tests back to build

* lint

* lint and fix build

* Fix

* Address Robert's comments

* Refactor start_ray_processes to start Redis shard

* lint

* Fix global scheduler python tests

* Fix redis module test

* Fix plasma test

* Fix component failure test

* Fix local scheduler test

* Fix runtest.py

* Fix global scheduler test for python3

* Fix task_table_test_and_update bug, from actor task table submission race

* Fix jenkins tests.

* Retry Redis shard connections

* Fix test cases

* Convert database clients to DBClient struct

* Fix race condition when subscribing to db client table

* Remove unused lines, add APITest for sharded Ray

* Fix

* Fix memory leak

* Suppress ReconstructionTests output

* Suppress output for APITestSharded

* Reissue task table add/update commands if initial command does not publish to any subscribers.

* fix

* Fix linting.

* fix tests

* fix linting

* fix python test

* fix linting
2017-05-18 17:40:41 -07:00
shane
0a4304725f adding -x for clearer output in build console log (#565) 2017-05-18 17:04:56 -07:00
Philipp Moritz
28f0882387 Expose function table to python global control state API (#542)
* expose function table to python global control state API

* fix

* fix linting

* add test for function table
2017-05-16 20:06:13 -07:00
Robert Nishihara
ec2534422b Remove register_class from API. (#550)
* Perform ray.register_class under the hood.

* Fix bug.

* Release worker lock when waiting for imports to arrive in get.

* Remove calls to register_class from examples and tests.

* Clear serialization state between tests.

* Fix bug and add test for multiple custom classes with same name.

* Fix failure test.

* Fix linting and cleanups to python code.

* Fixes to documentation.

* Implement recursion depth for recursively registering classes.

* Fix linting.

* Push warning to user if waiting for class for too long.

* Fix typos.

* Don't export FunctionToRun if pickling the function fails.

* Don't broadcast class definition when pickling class.
2017-05-16 18:38:52 -07:00
Eric Liang
e2e9e4ce6f Fix segmentation fault when calling ray.put on a dictionary with object keys (#548)
* fix segfault when serializing dict key

* fix style

* fix test

* Fix linting.
2017-05-15 01:09:13 -07:00
Robert Nishihara
9f91eb8c91 Change API for remote function declaration, actor instantiation, and actor method invocation. (#541)
* Direction substitution of @ray.remote -> @ray.task.

* Changes to make '@ray.task' work.

* Instantiate actors with Class.remote() instead of Class().

* Convert actor instantiation in tests and examples from Class() to Class.remote().

* Change actor method invocation from object.method() to object.method.remote().

* Update tests and examples to invoke actor methods with .remote().

* Fix bugs in jenkins tests.

* Fix example applications.

* Change @ray.task back to @ray.remote.

* Changes to make @ray.actor -> @ray.remote work.

* Direct substitution of @ray.actor -> @ray.remote.

* Fixes.

* Raise exception if @ray.actor decorator is used.

* Simplify ActorMethod class.
2017-05-14 00:01:20 -07:00
Robert Nishihara
f32368bcbe Prevent actors from being placed on removed nodes or nodes with no CPUs. (#527)
* Make note about bug in which actor creation notification message is not received.

* Prevent actors from being created on removed nodes.

* Prevent actors from being created on nodes with no CPUs.

* Fix linting.

* Add test for scheduling actors on local schedulers with no CPUs.

* Improve error message when actors created before ray.init called.
2017-05-08 20:39:43 -07:00
Robert Nishihara
c688a64235 Expose GPU IDs to remote functions. (#496)
* Change local scheduler bookkeeping to use GPU IDs.

* Update actor test.

* Add tests for actors and tasks simultaneously using GPUs.

* Add additional task GPU ID test.

* Fix linting.

* Make redis GPU assignment ignore GPU IDs.

* Small fix.
2017-05-07 13:03:49 -07:00
Robert Nishihara
8532ba4272 Serialize lambdas, sets, and types with pickle by default. (#511)
* Serialize lambdas with pickle by default.

* Serialize sets with pickle by default.

* Serialize types with pickle by default.

* Small update to documentation.

* Update tests.
2017-05-04 00:16:35 -07:00
Robert Nishihara
245c8ab888 Make sure user seeding does not affect actor ID generation. (#506)
* Make sure user seeding does not affect actor ID generation.

* Fix linting.

* Add test.
2017-05-03 16:29:55 -07:00
Robert Nishihara
1627f89945 Fix problem in which actors and workers running tasks are not killed by driver exit. (#490)
* Augment test to verify that relevant workers and actors are killed during driver cleanup.

* Fix bug in which we were only killing one worker when a driver exited.

* Fix remove driver test.

* Fix and augment test.
2017-04-26 15:13:39 -07:00
Robert Nishihara
0ac125e9b2 Clean up when a driver disconnects. (#462)
* Clean up state when drivers exit.

* Remove unnecessary field in ActorMapEntry struct.

* Have monitor release GPU resources in Redis when driver exits.

* Enable multiple drivers in multi-node tests and test driver cleanup.

* Make redis GPU allocation a redis transaction and small cleanups.

* Fix multi-node test.

* Small cleanups.

* Make global scheduler take node_ip_address so it appears in the right place in the client table.

* Cleanups.

* Fix linting and cleanups in local scheduler.

* Fix removed_driver_test.

* Fix bug related to vector -> list.

* Fix linting.

* Cleanup.

* Fix multi node tests.

* Fix jenkins tests.

* Add another multi node test with many drivers.

* Fix linting.

* Make the actor creation notification a flatbuffer message.

* Revert "Make the actor creation notification a flatbuffer message."

This reverts commit af99099c8084dbf9177fb4e34c0c9b1a12c78f39.

* Add comment explaining flatbuffer problems.
2017-04-24 18:10:21 -07:00
Philipp Moritz
8ac6c59931 Remove n^2 algorithm in plasma get (#466)
Remove n^2 algorithm in plasma get.
2017-04-17 23:37:33 -07:00
Robert Nishihara
c802e51d36 Re-enable recursive remote functions in a limited form. (#453)
* Re-enable recursive remote functions in a limited form.

* Fix linting.
2017-04-13 01:47:33 -07:00
Richard Liaw
c3a2505ffd Loadbalancing Test issue (#452)
* Limiting number of CPUs in loadbalancing test

* fixes as requested
2017-04-11 22:33:58 -07:00
Robert Nishihara
f4c1adae17 Unify function signature handling between remote functions and actor … (#441)
* Unify function signature handling between remote functions and actor methods.

* Fixes.

* Fix tests.
2017-04-08 21:34:13 -07:00
Robert Nishihara
7cd00741b1 Suppress irrelevant Redis connection errors. (#434)
* Suppress error messages in worker import thread when Redis terminates.

* Suppress some warnings from one of the tests.
2017-04-07 23:19:24 -07:00
Robert Nishihara
0eac3ccdd0 Reduce verbosity of component_failures_test.py. (#440) 2017-04-07 23:05:29 -07:00
Robert Nishihara
7af6f462fb Add API for querying global control state. (#431)
* Add API for querying global control state.

* Fix linting.

* Fix errors in Python 2.

* Fix bug in test.

* Fix bug in test.
2017-04-06 23:51:12 -07:00
Robert Nishihara
320109a5bd By default, start a number of workers equal to the number of CPUs. (#430)
* By default, start a number of workers equal to the number of CPUs.

* Fix stress tests.
2017-04-06 00:02:58 -07:00
Robert Nishihara
fa363a5a3a Notify driver when a worker dies while executing a task. (#419)
* Notify driver when a worker dies while executing a task.

* Fix linting.

* Don't push error when local scheduler is cleaning up.
2017-04-06 00:02:39 -07:00
Philipp Moritz
4043769ba2 Make putting large objects work. (#411)
* putting large objects

* add more checks

* support large objects

* fix test

* fix linting

* upgrade to latest arrow version

* check malloc return code

* print mmap file sizes

* printing

* revert to dlmalloc

* add prints

* more prints

* add printing

* printing

* fix

* update

* fix

* update

* print

* initialization

* temp

* fix

* update

* fix linting

* comment out object_store_full tests

* fix test

* fix test

* evict objects if dlmalloc fails

* fix stresstests

* Fix linting.

* Uncomment large-memory tests.

* Increase memory for docker image for jenkins tests.

* Reduce large memory tests.

* Further reduce large memory tests.
2017-04-05 01:04:05 -07:00
Robert Nishihara
ba02fc0eb0 Run flake8 in Travis and make code PEP8 compliant. (#387) 2017-03-21 12:57:54 -07:00
Stephanie Wang
083e7a28ad Push an error to the driver when the workload hangs on ray.put reconstruction (#382)
* Fix worker blocked bug

* tmp

* Push an error to the driver on ray.put for non-driver tasks

* Fix result table tests

* Fix test, logging

* Address comments

* Fix suppression bug

* Fix redis module test

* Edit error message

* Get values in chunks during reconstruction

* Test case for driver ray.put errors

* Error for evicting ray.put objects from the driver

* Fix tests

* Reduce verbosity

* Documentation
2017-03-21 00:16:48 -07:00
Johann Schleier-Smith
29c8471fd4 Add multinode tests by simulating multiple nodes using Docker. (#378)
* run test workloads for a Docker cluster

* better manage docker image versions

* Changes to make multinode docker tests work with Python 3.

* option to mount local test directory on head node to speed development

* Attempt to simplify multinode test setup.

* Small change.

* Add in development-mode to run multinode docker tests more easily during development.

* add jenkins test script that links to Docker hash

* Read docker SHA from build_docker.sh and add test that should fail.

* Consolidate implementations and remove duplicate files.

* Allow test to retry if it fails to schedule on all nodes.

* Remove sleep when in docker multinode tests.
2017-03-18 23:44:54 -07:00
Stephanie Wang
12c9618c0c Plasma and worker node failure. (#373)
* Failing test case

* Local scheduler exits cleanly after plasma store dies

* Tolerate one plasma store failure

* Tolerate plasma store failures on all nodes except head node

* Plasma manager heartbeats

* Component failure tests

* Don't run the helper for Python testing

* Fix C test

* Fix hanging plasma transfer test

* Fix python3

* Consolidate ClientConnection code

* Fix valgrind test

* fix c test

* We can restart worker nodes!

* Fix flatbuffers bug

* Address comments

* Only register actual workers with the local scheduler

* Fix bug

* Fix segfaults

* Add test case that tests for driver liveness, fix local scheduler bug

* Clean up after tests

* Allocate retry info on the stack

* Send SIGKILL before waiting

* Relax unit test conditions

* Driver liveness test case and documentation
2017-03-17 17:03:58 -07:00
Robert Nishihara
6b1e8caf2d Reduce stress_test verbosity. (#377) 2017-03-16 20:10:56 -07:00