#clojure log - Apr 17 2008

The Joy of Clojure
Main Clojure site
Google Group
IRC
List of all logged dates

8:48 Chouser: Hm, perhaps slurp should use FileChannel.map

8:49 rhickey: That could be cool - never used the NIO stuff

8:51 can you get it into a string without it being in memory twice?

8:54 Chouser: I dunno, but it might be possible.

8:56 Docs say read is faster for files smaller than "a few tens of kilobytes"

8:57 You get a subclass of ByteBuffer, not a string.

9:02 rhickey: are you proposing slurp return a ByteBuffer?

9:03 Chouser: did you log yesterday's irc? I was offline

9:04 Chouser: you didn't miss much: http://n01se.net/chouser/clojure-log/2008-04-16.html

9:04 should I just let google loose on those logs? We can always move them later.

9:05 rhickey: sure

9:05 thanks

9:05 Chouser: ok

9:13 * drewr wonders why the logs are blank in his browser

9:14 drewr: Works in Safari, but not FF3.

9:17 jteo: true

9:25 Chouser: I've had people complain about other of my sites in FF3. Works fine for me, though.

9:26 drewr: You on OS X?

9:26 Chouser: nope, Linux.

9:26 I can try OS X. That's where it's failing for you?

9:27 drewr: Yup. Just get a blank gradient background.

9:27 cgrand: Chouser: I think FF3 chokes on the empty script tag (FF3/win xp) (if I edit it it works)

9:27 jteo: same here. FF3.

9:30 Chouser: hm!

9:31 FF2 on OS X works fine. I don't have FF3 on that machine yet.

9:32 The script tag has a src="". Without that you're missing the navigation links, right?

9:33 drewr: I'm experimenting with this tracing code that rhickey posted to the list. I've evaled all the forms at the REPL, but when I get to (trace fact), I get "no such var: clojure/traced." What's the deal?

9:33 cgrand: use <script type="text/javascript" src="irc.js"></script> instead of <script type="text/javascript" chouser: src="irc.js"/> (<script> should not be "collapsed")

9:33 drewr: It should be in the user ns.

9:34 Chouser: cgrand: oh, of course. thanks.

9:39 there, how's that?

9:39 drewr: Chouser: :-)

9:40 cgrand: chouser: works

9:40 Chouser: great, thanks for your help.

9:44 ByteBuffer seems pretty hard to deal with, compared to String.

9:45 cgrand: chouser: will you regenerate logs anterior to 2008-04-13?

9:47 Chouser: cgrand: I don't have them.

9:48 My IRC client is generating the raw logs that I'm using, so I've got nothing from before I joined the channel.

9:48 oh, wait.

9:48 sorry, misundertood

9:48 cgrand: :-)

9:49 Chouser: hm, those should have been done already.

9:56 cgrand: chouser: last modification time says 13-Apr-2008 12:52 :-(

10:03 Chouser: there, try that.

10:10 cgrand: chouser: perfect

10:58 rhickey: drewr: trace working now?

11:08 drewr: rhickey: No, not yet.

11:09 Can't figure out why it wants traced to be in clojure's namespace.

11:13 Chouser: How do I write a type hint for a Java byte[]? #^byte[] doesn't work. ;-)

11:15 rhickey: you can type hint arrays with the java.lang.Class.getName format as a String: #^"[B"

11:16 Chouser: yum! ok.

11:18 hehe. this syntax highlighter totally wigs out on that.

11:19 rhickey: drewr: want to try the latest (819) with clean/build?

11:19 drewr: Hm, I'm at 818. Let me try that.

11:21 Chouser: re-seq on a 3MB file: with slurp 1016 msecs, with map-slurp 375 msecs

11:23 drewr: rhickey: Didn't help.

11:24 ...for the Repl. It works now in Script.

11:24 Chouser: re-seq on a 13MB file: with map-slurp 1285 msecs, with slurp OutOfMemoryError: Java heap space

11:25 rhickey: are you using asCharBuffer?

11:27 Chouser: no, I couldn't get that to work for me.

11:27 rhickey: drewr: just did the same thing here, works in Repl fine, hmm...

11:27 Chouser: It looks like maybe asCharBuffer is interpreting the bytes as UTF-16 or something.

11:27 rhickey: Chouser: so what does map-slurp do?

11:27 Chouser: I wrote my own CharSequence proxy

11:28 drewr: Another related question. TRACE and UNTRACE both RESOLVE the function that gets passed in, however, when I do that directly, I get a ClassCastException. What's the difference between (resolve fact) at the REPL and that inside the macro?

11:28 rhickey: (resolve 'fact)

11:29 drewr: Yes, but TRACE doesn't call (resolve (quote f)), it calls (resolve f).

11:29 vincenz: drewr: but it's a macro, so it's passed in the symbol

11:29 drewr: that had me scratching my head a long time too, why is trace a macro

11:29 drewr: vincenz: Ah, thanks.

11:30 vincenz: it's to get the name of the function in there, not the value

11:30 drewr: Of course.

11:30 rhickey: like doc, trace is a macro because it's really a repl-user convenience thing

11:31 I don't recommend doing that generally for things that take symbols

11:31 drewr: rhickey: BTW, I did a C-c C-k to compile and load the file, and it worked doing that. I'm not sure why C-M-x on the forms didn't work originally.

11:32 rhickey: I used C-M-x on each form

11:32 drewr: Interesting.

11:33 rhickey: but get same error as you when I C-M-x on all the forms!

11:33 drewr: Hm. What's the difference between your last two comments?

11:34 rhickey: one-at-at-time vs block

11:34 vincenz: rather odd

11:34 drewr: Can you do C-M-x on a region? What do you mean by block?

11:35 rhickey: region, I don't know if it is supposed to work

11:36 Chouser: There's probably a better way to do this, but here's my map-slurp: http://n01se.net/paste/F0I

11:37 drewr: I generally do this: M-< to get to the top of the buffer, and then C-M-x, C-M-e all the way down. That's what failed me the first time with the trace stuff. Not sure why that's different.

11:37 I had a clean JVM because I restarted after I rebuilt clojure.jar.

11:40 rhickey: Chouser: interesting. I'm not sure the length times matter, but the re-seq time diff is something. Are you running -server, multiple tries?

11:41 I've found generally that laziness has provided a whole additional set of benefits in performance due to reduced heap pressure, in spite of the ephemeral garbage it generates

11:42 Chouser: rhickey: no -server, and "multiple" only on the order of 4 or 5 times, but the results seem stable.

11:42 rhickey: -server rocks

11:43 some parts of Clojure can be 4-10x faster

11:43 cgrand: chouser: .length is unfair: with map-slurp it returns the byte-size and slurp the character size

11:44 Chouser: cgrand: good point! I hadn't thought of that.

11:44 rhickey: yeah, running through is all that matters

11:45 Chouser: I included the length example mainly to show slurp just falls over at that size.

11:47 I also assume there are things you might want to do with slurp where you really want a String, where a CharSequence won't cut it.

11:47 rhickey: that's the diff between eager and lazy

11:48 Chouser: Using the toString method there would presumably destroy the benefit of map-slurp

11:48 rhickey: for the memory usage related benefits

11:48 Chouser: rhickey: yeah, I guess that's true. I hadn't thought of it that way, but this is lazy right through the OS down to the disk.

11:49 rhickey: I think it is really interesting, need to look more at CharSequence

11:53 a lot of the bridging Clojure does to String in API funcs could be done at CharSequence level

11:53 seq/nth/get/count

11:56 Chouser: I wonder if there's a better way to do toString there, too. I'm copying from the mapped buffer into an array, and then I think String makes another copy.

11:56 can I proxy byte[]?

11:58 cgrand: chouser: I don't think so

12:03 chouser: whatever (except another String) you pass to a String constructor will get copied because it's mutable (char[] or byte[] or StringBuilder/StringBuffer)...

12:08 Chouser: sure, but I'd like to copy once instead of twice

12:08 I'd like to hand a CharSequence directly to String, for example, instead of having to copy into an array first.

12:13 cgrand: the better you can do is to build a char[] from the ByteBuffer and pass it to String :-( (If you pass a byte[], this array is copied before decoding (!) and then a char[] is allocated...)

12:15 Chouser: :-(

12:23 cgrand: have you tried Charset.forName("UTF-8").newDecoder().decode(bytebuffer).subSequence... to get chouser: a CharSequence with correct charAt and length?

12:24 (oops "chouser:" should have been at the start of the message...)

12:43 Chouser: heh. no, I didn't. What I've got already really stretched my Java (and JavaDoc) abilities.

12:43 Let me try it...

12:51 in that expression, decode() is eager, isn't it?

12:52 cgrand: I think it works by chunks, let me check

13:00 er.. You're right it's eager... if you want to process the input lazily you'll have to split it into multiple bytebuffers and write a charsequence proxy which delegates to the subsequences etc. :-<

13:02 the good news are tha CharsetDecoder is stateful and hence should work even if you split the inpu in the middle of a multibyte character... pfff... no pain no gain I guess

13:03 Chouser: I just realized that chatAt in my proxy isn't right for multibyte encodings anyway.

13:04 For the same reason you already pointed out for length.

13:04 rhickey: asCharBuffer doesn't do the right things?

13:04 Chouser: I can't figure out how to tell asCharBuffer to use a specific encoding, and it's default appears to be incorrect for an ASCII file.

13:08 it's weird to me that the docs for CharBuffer make no mention of encoding at all.

13:08 cgrand: just looked at the source for asCharBuffer: it's not pretty: UTF-16 is hardcoded (so to bypass all the encoding stuff)

13:09 Chouser: ok, that's the impression I was starting to get from the docs.

13:10 cgrand: The only way to go from a ByteBuffer to a CharBuffer with a specific encoding is through CharsetDecoder...

13:11 Chouser: there's really a fundamental problem for lazy file reading here. For fixed width encodings, it's easy(UTF-16 for asCharBuffer, ASCII for my proxy)

13:12 rhickey: variable byte chars stink and always have

13:12 Chouser: for variable-width (like UTF-8), what do you do when some asks for charAt 55? To do it correctly, you must scan to that point.

13:13 I bet re-seq is scanning the inpput in order anyway, though, so there may still be speed improvements over slurp available here.

13:15 cgrand: true random access in strings is not that common (you always scan in one sense or the other)

13:16 (interseting post on that stuff http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html)

13:16 (string representation, not java.nio)

13:27 Chouser: nice post, thanks.

13:37 rhickey: it's a shame you can't piggyback on CharsetDecoder lazily, but it's hardwired to decode to CharBuffer, which is not an interface, but a large abstract class I imagine decode uses very little of. The java.io guys really need to learn about interfaces from the java.util guys

13:39 Chouser: I imagine it won't be too terribly hard to decode chunks of an input ByteBuffer into CharBuffers lazily, and make available as a CharSequence.

13:39 Some clever use of lazy-cons may even make it somewhat attractive.

13:41 rhickey: relying on the consuming code not calling charAt very far ahead, or length?

13:41 Chouser: ooh, I hadn't thought of length.

13:41 But yes, O(n) access for charAt.

13:41 cgrand: or very far behind unless you retain the head...

13:42 Chouser: I guess it would have to be O(n) for length too.

13:42 rhickey: but charAt is all you've got in CharSequence

13:42 it's not much of a sequence

13:43 Chouser: bah, re-seq uses the results of length.

13:44 rhickey: I think all of the coolness of FileChannel.map disappears for variable-byte char files

13:44 Chouser: Hmph. mmap is still generally more efficient than buffered reads.

13:45 But that doesn't mean you're wrong.

13:45 cgrand: what about doing a first decodong pass (by chunks) to remember some (char-offset, byte-offset) pairs (eg every 1k bytes) and then using this info to make charAt O(1)?

13:45 rhickey: is 2 passes still more efficient than buffered reads?

13:46 Chouser: What about re-writing Java regex so it doesn't need length?

13:46 rhickey: there's an amazing lack of connection between java.io and java.nio

13:47 cgrand: chouser: or better composing the encoding and the regex to get a regex on bytes... :-)

13:47 rhickey: can't strap readers onto channels?

13:48 Chouser: rhickey: what would that buy you?

13:48 rhickey: they must have the decoder logic built int

13:51 Chouser: io.InputReader seems to, yes.

13:55 rhickey: could extend InputStream for ByteBuffer, to test your mmap vs buffered io theory

13:56 Chouser: yep, halfway done. ;-)

13:56 can I proxy methods with the same name and different arities?

13:56 rhickey: one function handles all arities, just use the normal Clojure arity overloading

13:58 Chouser: ok

14:01 rhickey: not sure if this is useful: http://www.exampledepot.com/egs/java.nio/Buffer2Stream.html

14:09 Chouser: yep, I didn't realize I could provide such a small subset of InputStream methods.

14:42 http://n01se.net/paste/8Oq -- cgrand's suggestion

14:42 It's not lazy, but it's apparently pretty efficient.

16:20 Issues with mmap in Java: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4724038

16:24 rhickey: ouch

16:25 Chouser: but that's only on Windows. Does anyone still use that?

16:26 rhickey: is it only Windows?

16:27 or just where reported?

16:29 Chouser: I'm not sure, but I think the reason it's a problem to keep the file mapped is that Windows lock access to it while open.

16:30 rhickey: I guess I shouldn't change slurp just yet :)

16:31 Chouser: Linux would probably allow you to unlink the file, for example, and keep the blocks on disk until the map is GC'ed, while you can go ahead and reuse the filename.

20:34 if I have a list of pairs [[1 2] [3 4] ...], it's very natural to consume them lazily using (for [[a b] lst] ...)

20:35 if instead I have a list that I want to consume as pairs [1 2 3 4 ...], I can't think of any natural way to consume them lazily.

20:36 (for [[a b] (apply array-map lst)] ...) ; convenient, but eager

20:39 rhickey: (map vector (take-nth 2 x) (take-nth 2 (rest x)))

20:41 Chouser: hm. better than the recursive lazy-cons thing I was growing...

20:42 I don't suppose that's something that could be shimmed into deconstruction?

20:42 rhickey: I wanted to write a take-ns that would do that...

20:43 thinking about destructuring now...

20:44 Chouser: I can't even think of how the syntax would work for destructuring, let alone how to implement it.

20:44 rhickey: it's not really a good fit for destructuring

20:44 Chouser: generally destrucuring takes what it wants and throws away the rest.

20:44 rhickey: right

20:44 Chouser: take-ns sounds nice, though. And you already wrote it. ;-)

20:46 rhickey: (defn take-ns [n xs]

20:46 (when (seq xs)

20:46 (lazy-cons (take n xs) (take-ns n (drop n xs)))))

20:58 Chouser: beautiful

20:59 * Chouser uses it.

21:03 jonathan_: ncie

21:03 nice

Logging service provided by n01se.net