curl-library
Cache in curl... (fwd)
From: Daniel Stenberg <daniel_at_haxx.se>
Date: Mon, 16 Sep 2002 16:57:07 +0200 (MET DST)
Date: Mon, 16 Sep 2002 16:57:07 +0200 (MET DST)
This mail from Lorenzo Pastrana is forwared to the list with permission for
the entire crowd to read and digest. Feel free to comment and elaborate.
-- Daniel Stenberg -- curl related mails on curl related mailing lists please ---------- Forwarded message ---------- Date: Sat, 14 Sep 2002 15:53:41 +0200 From: Lorenzo Pastrana <pastrana_at_ultraflat.net> I didn't edited too much the discussion history for you feel free to forward this (edited?) response to the list, in case you think it's a convenient starting point for the "what cache do you want into curl" debate ;p or whatever quick poll : [ ] None thanks. [X] Strictly validated [ ] Full blown HTTP1.1 fuzzy logic [ ] Don't care... Read an argumentation at the bottom... > > What I have for the moment is a working C++ wrapper around curl. It is by > > now supporting the very basic operations (HTTP GET/PUT/POST - FTP UL/DL, > > no cookies nor SSL for the moment but this won't last) and does not reflect > > (I'd even say hides) all curl's bells & whistles (KISS! model :). > > > > The interface is : Give An Url and Get The Data back (BTW if the cached > > file is ok, don't bother downloading). period > > Sounds pretty straight-forward. :) > > > I don't think it needs to be added to the curl package but can be > > > a separate package without much loss. Or what do you think? > > > > The fact is I realize that the cache management is about 5% of the code, > > the rest is there to give the client the simplest interface possible. So I > > was thinking that since the simplest local-cache is just a matter of saving > > the incomming data to a given location (say CURLOPT_CACHEFOLDER) and > > checking it back on further request for up-to-datedness, it could be done > > in a completely transparent manner, this would mainly avoid data > > replication for keeping track of active tasks, handles time and so on.. > > Well, I could be convinced that having the cache functionality is a good can I read you don't dislike the idea ? > thing, but then you need to take this to the mailing list and get some > responses from other libcurl users. These days I try to not add things > to the library without getting a feel for what people actually want. Yeah, I guess it's a quite fair position.. Anyway I'm not trying to 'convince' you to put a cache mechanism in curl, I just say that the only reason I see for placing it there is that it will surely reduce the memory footprint of sutch a feature. > Also, libcurl is plain C. There will be no C++ within libcurl so you might > need to reconsider things if this is meant for inclusion and you have the > code in C++... Of course I didn't ment this code to be crudely plugged into curl.. ;) But again, what I have is -mainly- a wrapper and as I said the cache code portion is rather small, so plain C is fine... > Caching is clearly very detailed in RFC2616. How to do it, which > headers to use and how to behave in all situations. I'm not sure > CURLOPT_TIMECONDITION is enough. I just never studied the cache > sections very closely so I can't tell you any more without further > studying. Well, i've been looking into RFC2616 these last few days, and what is ment there by 'cache' is mainly some web infrastructure related machinery (as per proxy-caching, intermediate-cache etc..) witch is far beyond what I think is my concern. Actually what I mean by cache is a very restricted portion of that mechanism as done by most web browsers -> a simple client facility. of course, delivering correct content to the client is still a MUST and the rfc states about all that. For purpose of correctness, from what I read in the rfc, TIMECONDITION is a perfectly legal cache validator (a 'weak' one though, if we can call it so, due to the 1 second resolution of time data :). ETag management (thru If-match / If-None-Match request header) is also required to be correct in HTTP1.1 (this one is a strong validation), I'm working on it too. But it seams to me that it will be ok with those two : <cache behaviour proposal> Of course supporting the full blown cache related bells & wistles included in HTTP1.1 is necessary for completely optimal caching (witch is ment to prevent systematic validation request to the origin server) But here I believe that if I can just avoid the download of the body part I'll be happy, and best of all : quite sure that what I get is valid content. All other methods (based on expiration) implies assumptions and heuristics that can lead to very annoying situations, especially when you're a developper ;) Don't you ask your browser to validate each request ? I guess this is for the same reason why I see "Pragma: no-cache"(witch produces an End-to-end reload) request header by default in curl. This being a matter of taste, I think we can consider adding some "Expires:" response header management as an option. But delivering a non-validated cache file would preferaibly be some sort of fall-back solution in case of off-line operation/network-error ... </cache behaviour proposal> Ok I think I'll end on this.. you'll tell me how you feel about it. What I can say is that I'm ok to get hands on this stuff as far as the debate don't end-up in 'we want a full HTTP1.1 cache mechanism!', and I don't have to learn 'all' curl internals to do it, witch I can't afford : I still didn't get to check curl's source and I'll surely need some guidance... : I see some points in the code where the feature would need to be plugged : 1) : the place where you begin to process an handle, before you affect a connection to it, verify (option) if the cache-file would be fresh enough for immediate delivery in case an expire date was set by the server or (better) get the cache-file time & ETag string if ever to forge appropriate conditional request. 2) : the place you call the header callback, in order to track : "no-cache" directives to avoid caching, "Epires: " and "ETag: " headers for storing into the cache-file header. 3) : the place you call the write callback, for saving the cache file if not prevented by "no-cache" directive. 4) : the place where you get final HTTP response code to deliver the cache-file in case of a 304(Not-Modified). .... okok this time I'll stop... ;) curl_easy_setopt((CURL*)me,CURLOPT_VERBOSE,false); Cheers. Lo. ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sfReceived on 2002-09-16