curl-library
Re: Function to return the body of a page?
Date: Thu, 14 Mar 2002 09:35:03 -0600 (CST)
FWIW, we have written a crawler that uses cURL and the write memory
callback functions to retrieve both the contents and the header
information. We have added a bandwidth limiting function to the
callback function to limit bandwidth usage on the crawls. It think
that our code might be a good example for you to look at of how to
retrieve data with the cURL library and push it around after you
get it in memory.
The crawler is in a class, and with a couple of modifications, you
should be able to hack it up to do just about anything with the data
once it is crawled. There is a test program in the /crawler directory
that will allow you to test the crawler class without actually running
the client. The source is located on SourceForge, and I'd be happy to
help you or anyone else in your efforts to utilize the code.
http://sourceforge.net/projects/grub/
If you don't use CVS, or need a recent tarball, let me know.
Just to be clear, our crawler utilizes curl in a multi-threaded process,
starting up separate threads for each crawler, and manages the input/
output of URLs and data to and from the MetaKit library, which stores
the URLs and data in separate files on the hard drive, much like a
database. In order to compile the client you need the latest MetaKit
release and the 7.9.4 or 7.9.5 release of cURL!
Later,
Kord
On Wed, 13 Mar 2002, VanL wrote:
> Hello,
>
> You said:
>
>
> >
> >You should look at getinmemory.c instead, That example does exactly
> >what you describe.
> >
> Like this?
>
> #include <stdio.h>
> #include <curl/curl.h>
> #include <curl/types.h>
> #include <curl/easy.h>
>
> struct MemoryStruct {
> char *memory;
> size_t size;
> };
>
> size_t
> WriteMemoryCallback(void *ptr, size_t size, size_t nmemb, void *data)
> {
> register int realsize = size * nmemb;
> struct MemoryStruct *mem = (struct MemoryStruct *)data;
>
> mem->memory = (char *)realloc(mem->memory, mem->size + realsize + 1);
> if (mem->memory) {
> memcpy(&(mem->memory[mem->size]), ptr, realsize);
> mem->size += realsize;
> mem->memory[mem->size] = 0;
> }
> return realsize;
> }
>
>
> struct MemoryStruct get_data(char* URL, int* curl_handle)
> {
>
> struct MemoryStruct chunk;
>
> chunk.memory=NULL; /* we expect realloc(NULL, size) to work */
> chunk.size = 0; /* no data at this point */
>
> /* specify URL to get */
> curl_easy_setopt(curl_handle, CURLOPT_URL, URL);
>
> /* send all data to this function */
> curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION,
> WriteMemoryCallback);
>
> /* we pass our 'chunk' struct to the callback function */
> curl_easy_setopt(curl_handle, CURLOPT_FILE, (void *)&chunk);
>
> /* get it! */
> curl_easy_perform(curl_handle);
>
> return chunk;
> }
>
>
> int main(int argc, char **argv)
> {
>
> /* initialize curl */
>
> CURL *curl_handle;
>
> /* init the curl session */
> curl_handle = curl_easy_init();
>
>
> process_data(get_data("http://127.0.0.1:50002"));
>
>
>
> /* cleanup curl stuff */
> curl_easy_cleanup(curl_handle);
>
>
> return 0;
> }
>
>
>
>
>
>
-- -------------------------------------------------------------- Kord Campbell Grub.Org Inc. President 6051 N. Brookline #118 Oklahoma City, OK 73112 kord_at_grub.org Voice: (405) 843-6336 http://www.grub.org Fax: (405) 848-5477 --------------------------------------------------------------Received on 2002-03-14