cURL / Mailing Lists / curl-library / Single Mail

curl-library

Cache in curl... (fwd)

From: Daniel Stenberg <daniel_at_haxx.se>
Date: Mon, 16 Sep 2002 16:57:07 +0200 (MET DST)

This mail from Lorenzo Pastrana is forwared to the list with permission for
the entire crowd to read and digest. Feel free to comment and elaborate.

-- 
 Daniel Stenberg -- curl related mails on curl related mailing lists please
---------- Forwarded message ----------
Date: Sat, 14 Sep 2002 15:53:41 +0200
From: Lorenzo Pastrana <pastrana_at_ultraflat.net>
I didn't edited too much the discussion history for you feel free to forward
this (edited?) response to the list, in case you think it's a convenient
starting point for the "what cache do you want into curl" debate ;p or
whatever quick poll :
	[ ] None thanks.
	[X] Strictly validated
	[ ] Full blown HTTP1.1 fuzzy logic
	[ ] Don't care...
Read an argumentation at the bottom...
> > What I have for the moment is a working C++ wrapper around curl. It is
by
> > now supporting the very basic operations (HTTP GET/PUT/POST - FTP UL/DL,
> > no cookies nor SSL for the moment but this won't last) and does not
reflect
> > (I'd even say hides) all curl's bells & whistles (KISS! model :).
> >
> > The interface is : Give An Url and Get The Data back (BTW if the cached
> > file is ok, don't bother downloading). period
>
> Sounds pretty straight-forward.
:)
> > > I don't think it needs to be added to the curl package but can be
> > > a separate package without much loss. Or what do you think?
> >
> > The fact is I realize that the cache management is about 5% of the code,
> > the rest is there to give the client the simplest interface possible. So
I
> > was thinking that since the simplest local-cache is just a matter of
saving
> > the incomming data to a given location (say CURLOPT_CACHEFOLDER) and
> > checking it back on further request for up-to-datedness, it could be
done
> > in a completely transparent manner, this would mainly avoid data
> > replication for keeping track of active tasks, handles time and so on..
>
> Well, I could be convinced that having the cache functionality is a good
can I read you don't dislike the idea ?
> thing, but then you need to take this to the mailing list and get some
> responses from other libcurl users. These days I try to not add things
> to the library without getting a feel for what people actually want.
Yeah, I guess it's a quite fair position..
Anyway I'm not trying to 'convince' you to put a cache mechanism in curl, I
just say that the only reason I see for placing it there is that it will
surely reduce the memory footprint of sutch a feature.
> Also, libcurl is plain C. There will be no C++ within libcurl so you might
> need to reconsider things if this is meant for inclusion and you have the
> code in C++...
Of course I didn't ment this code to be crudely plugged into curl.. ;)
But again, what I have is -mainly- a wrapper and as I said the cache code
portion is rather small, so plain C is fine...
> Caching is clearly very detailed in RFC2616. How to do it, which
> headers to use and how to behave in all situations. I'm not sure
> CURLOPT_TIMECONDITION is enough. I just never studied the cache
> sections very closely so I can't tell you any more without further
> studying.
Well, i've been looking into RFC2616 these last few days, and what is ment
there by 'cache' is mainly some web infrastructure related machinery (as per
proxy-caching, intermediate-cache etc..) witch is far beyond what I think is
my concern.
Actually what I mean by cache is a very restricted portion of that mechanism
as done by most web browsers -> a simple client facility. of course,
delivering correct content to the client is still a MUST and the rfc states
about all that.
For purpose of correctness, from what I read in the rfc, TIMECONDITION is a
perfectly legal cache validator (a 'weak' one though, if we can call it so,
due to the 1 second resolution of time data :).
ETag management (thru If-match / If-None-Match request header) is also
required to be correct in HTTP1.1 (this one is a strong validation), I'm
working on it too. But it seams to me that it will be ok with those two :
<cache behaviour proposal>
Of course supporting the full blown cache related bells & wistles included
in HTTP1.1 is necessary for completely optimal caching (witch is ment to
prevent systematic validation request to the origin server)
But here I believe that if I can just avoid the download of the body part
I'll be happy, and best of all : quite sure that what I get is valid
content.
All other methods (based on expiration) implies assumptions and heuristics
that can lead to very annoying situations, especially when you're a
developper ;) Don't you ask your browser to validate each request ? I guess
this is for the same reason why I see "Pragma: no-cache"(witch produces an
End-to-end reload) request header by default in curl. This being a matter of
taste, I think we can consider adding some "Expires:" response header
management as an option.
But delivering a non-validated cache file would preferaibly be some sort of
fall-back solution in case of off-line operation/network-error ...
</cache behaviour proposal>
Ok I think I'll end on this.. you'll tell me how you feel about it.
What I can say is that I'm ok to get hands on this stuff as far as the
debate don't end-up in 'we want a full HTTP1.1 cache mechanism!', and I
don't have to learn 'all' curl internals to do it, witch I can't afford : I
still didn't get to check curl's source and I'll surely need some
guidance... :
I see some points in the code where the feature would need to be plugged :
1) : the place where you begin to process an handle, before you affect a
connection to it, verify (option) if the cache-file would be fresh enough
for immediate delivery in case an expire date was set by the server or
(better) get the cache-file time & ETag string if ever to forge appropriate
conditional request.
2) : the place you call the header callback, in order to track : "no-cache"
directives to avoid caching, "Epires: " and "ETag: " headers for storing
into the cache-file header.
3) : the place you call the write callback, for saving the cache file if not
prevented by "no-cache" directive.
4) : the place where you get final HTTP response code to deliver the
cache-file in case of a 304(Not-Modified).
.... okok this time I'll stop... ;)
curl_easy_setopt((CURL*)me,CURLOPT_VERBOSE,false);
Cheers.
Lo.
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
Received on 2002-09-16