#clojure log - Apr 30 2008

The Joy of Clojure
Main Clojure site
Google Group
IRC
List of all logged dates

13:55 Chouser: Here's a fun challenge (I'll try to do it myself if I get some time later)...

13:56 Implement MapReduce (and of course the standard wordcount example) for SMP (shared memory, not a cluster) in Clojure.

13:57 I'm betting the MapReduce "framework" on top of agents is no more than 10 lines of code, and the wordcount example would be another 3.

13:57 rhickey: Chouser: link?

13:58 Chouser: to MapReduce?

13:59 rhickey: I've read the MapReduce papers, wondering if there was a benchmark thingy

13:59 or implementation in another language...

13:59 Chouser: oh, no. I just made up the challenge because a friend was being impressed by this: http://labs.trolltech.com/blogs/category/labs/threads/qt-concurrent/

14:11 albino: Is that close to the widefinder description?

15:16 Chouser: is there some reason pmap manages its own thread pool instead of using agents?

15:17 I'm not smart enough to know *how* it would use agents, I'm just assuming that's possible... somehow...

15:47 rhickey: Chouser: pmap may have predated agents

15:47 Chouser: ah, ok.

16:25 nsinghal: I have a question

16:25 http://paste.lisp.org/display/59989

16:28 Chouser: your initial example works for me, replacing (new Socket) with (throw (new Exception))

16:32 nsinghal: Thats the question that it works that way - see the example at the end. Why with socket it doesnt work?

16:34 (import '(java.net Socket))

16:34 (defn connect [host port eatex]

16:34 (try

16:34 (new Socket host port)

16:34 (catch Throwable ex

16:34 (if eatex nil (throw ex)))))

16:34 (connect "127.0.0.1" 3432 true)

16:38 Chouser: oh, sorry, I missed that that was the point of the final example. I thought it had to do with the number of arguments.

16:54 well, I'm stumped.

16:55 rhickey: probably a bug with local clearing prior to tail call - looking at it now

17:18 nsinghal: fix is up

17:19 nsinghal: thx

17:31 MarkJP: how would I do a sleep in the first binding of a let...recur loop

17:31 don't ask

17:32 I could have a flag I guess

17:42 Chouser: inside the body of the loop, or up at the top?

17:48 MarkJP: I wanted to do it at the top for now I just added a new sleep flag to the let so the recur passes in a false

17:48 the let binds it to true the first time

17:49 Chouser: (loop [_ (Thread.sleep 3) x [:a :b :c]] (prn x) (when x (recur _ (rest x))))

17:49 (loop [_ (Thread.sleep 3) x [:a :b :c]] (prn x) (when x (recur _ (rest x))))

17:49 oops, sorry for the double-post

17:50 MarkJP: cool is that what _ does?

17:50 Chouser: ...and for the fact that it doesn't work.

17:50 oh, no, it does work. (Thread.sleep 3000) is a bit more noticable, though.

17:51 MarkJP: cool thanks

17:51 Chouser: No, _ is just a symbol. You could call it "blank" or "dummy" instead.

17:52 then again, I don't see how that's better than just (do (Thread.sleep 3000) (loop ...))

22:25 so here's my attempt at map-reduce: http://paste.lisp.org/display/60005

22:26 Also my first real attempt at using agents.

22:31 I wonder if ref-toggle will be useful for solving other problems.

22:31 * rhickey is still trying to understand it

22:34 Chouser: ref-toggle, or the whole ugly mess?

22:35 I don't really know how to comment code like this.

22:35 rhickey: Chouser: isn't there a partitioning between map and reduce, that makes each reduce job work on related data?

22:38 I think the agents doing mapping and interacting with reducer is too coupled

22:39 Chouser: in map-reduce3?

22:39 rhickey: and 2

22:40 number of agents should be independent of number of docs

22:40 Chouser: cool. Help me understand why. The number of agents is already independent of the number of processors -- so what should determine the number of agents?

22:41 rhickey: make a cycle of agents, distribute docs to each round robin...

22:41 await them

22:41 concat results, send to partitioner

22:41 distribute partitioned results to agent with reduce function

22:41 Chouser: won't the thread pool cycle through the agents for me? Why should I write that logic again?

22:41 rhickey: await

22:41 merge results

22:42 more agents means more independent results to merge, should be a setting

22:44 Chouser: every hash produced by a map run has to be merged, regardless of whether the agent producing it has produced other hashes already.

22:44 rhickey: would you have an agent per word?

22:45 Chouser: No, because there's no work to be done for a single word. I have one agent per parallizable map run.

22:46 The drawback of my solution is what, too much contention for the final-hash?

22:47 rhickey: I don't think you are capturing the mapreduce model

22:48 Chouser: hm, it's been a while since I've read the paper. Perhaps I should look at that again.

22:48 rhickey: partition input, divide among compute resources, partition intermediate results, divide reduction among compute resources, merge final result

22:48 Chouser: I see, and it's the partitioning at each step that I'm punting on.

22:49 rhickey: having a settable number of agents allows you to model compute resources, independent of how Clojure distributes to processors

22:50 but it's likely numagents == numprocessors is optimal for this, since even in map they are doing combining (of counts)

22:53 (assoc hash word (inc (get (hash word) 0))) more idiomatic

22:53 oops: (assoc hash word (inc (get hash word 0)))

22:54 Chouser: I don't like inc. Another keyword to learn, when (+ 1 ...) is just as succinct (usually) and more obvious.

22:54 (get hash word 0) is nice.

22:54 rhickey: the first partition is mechanical, the second logical

22:55 inc may be faster

22:55 Chouser: hm.

22:59 rhickey: second partition can be made mechanical, use rem of hash of key to index reducing agents

22:59 Chouser: well, it was a lot of fun to write. I'll have to re-read the paper to understand the purpose of the partitioning.

23:00 rhickey: you need to end up with one result per key, so you need to send all of the data for a particular key to the same reducing agent

23:01 otherwise the final merger is doing reducing

23:01 Chouser: Aaaaa...

23:02 should the passed-in reducer then not be operating on two hashes, but on two values (ints in this case)?

23:03 the reducer should just be +

23:03 rhickey: if you look at it as data flows...

23:03 docs flow into maps...

23:04 key/val maps flow out of maps into partition...

23:04 key/val pairs flow from partition to reducers

23:05 maps flow out of reducers and get concatenated as final result

23:05 Chouser: I think that Qt API I posted mislead me. The reduce function in Google's paper takes a key and a seq of values. Qt's takes two hashes.

23:08 rhickey: key and seq of values means partition is doing more work, so batch is all together, makes sense for the distributed case

23:08 the main thing is to keep the stages/steps separate

23:10 Chouser: if the seq only had 2 items, that would match your "key/val pairs flow ... to reducers" description, right?

23:11 rhickey: let's say there were 3 mapping agents

23:12 the first found "the" 1 time, the second 2 times and the third 3 times

23:12 one partitioning model determines that the "the" reducer is reducer # 2

23:13 when it get the result from mapper one, it sends ["the" 1] to reducer #2

23:13 later when it gets the results from mapper two, it sends ["the" 2] to reducer #2

23:13 etc

23:14 another model has it gather all the results from the mappers and send ["the" 1 2 3] to reducer #2

23:14 foones: Hi!

23:14 I'm trying to understand Clojure's support for concurrency

23:14 I think I've understand what agents and refs are

23:14 Chouser: Ok. And that's when I think I should have one agent for each word until I read your Socratic question from 30 minues ago.

23:15 foones: but I'm not sure as to how I can execute two pieces of code in parallel

23:16 rhickey: foones: (send agent1 ajob) (send agent2 ajob) (await agent1 agent2)

23:17 Chouser: rhickey: bedtime for me. Thanks for walking through this with me, I really appreciate it.

23:17 foones: Aha, and if in one of the jobs I need to read input (e.g. user input)

23:17 rhickey: sure

23:17 foones: it won't wait for it to end?

23:18 rhickey: foones: for blocking jobs use send-off, and don't await for it. It can put its result in a ref when they arrive

23:19 foones: Ahh, I understand

23:19 Thanks for your help

23:24 gomer-nyc: hi, does anybody have experience with returning proxies from a macro?

23:25 or rather, *a* proxy from a macro?

23:35 Chouser: gomer-nyc: you missed rhickey by 5 minutes.

23:35 And I'm going to bed. :-/ Maybe try the google discussion group?

23:36 gomer-nyc: yeah, good idea; I've searched but not much around proxies & macros that I'm looking for. maybe I'll post a q there. good night :-)

Logging service provided by n01se.net