cURL / Mailing Lists / curl-users / Single Mail

curl-users

Re: cURL starting questions

From: Ralph Mitchell <ralphmitchell_at_gmail.com>
Date: Sun, 19 Apr 2009 01:28:00 -0500

On Sat, Apr 18, 2009 at 11:25 PM, Jason Todd Slack-Moehrle <
mailinglists_at_mailnewsrss.com> wrote:

> Hi All,
>
> I have some starting cURL questions that I am hoping to gain insight about.
>
> I want to start at Dmoz.org and follow links for entertainment (like
> concerts, art gallery events, etc) and examine the link to see if I should
> get data back about it and from it.
>
> My questions:
>
> 1. Can cURL start at a given URL and examine every link (based upon my
> criteria)?
>
> 2. If I find a link that has certain keywords that I find of interest, can
> I hit that link of interest and get information from that page?
>
> 3. How do I get the information about the link of interest and its content
> of interest into a MySQL database? (I know ColdFusion and MySQL and PHP). I
> think what I am asking is how do I get back to my database from a crawler?
>
> 4. I bought Webbots, spiders and screen scrapers in PHP and so far it is
> interesting, but I am wondering what best practices are..
>
> Am I making any sense?
>
> -Jason

You should probably start here:

   http://curl.haxx.se/docs/httpscripting.html

Curl will only grab a web page for you, it won't attempt to interpret the
page. It won't even download images or script files unless you extract the
relevant urls from any given page and perform subsequent fetches.

Having said that, I've written some quite complicated scripts that can login
to certain airline and travel sites to extract information from several
levels down. In fact, one script used to go all the way through booking a
flight, only stopping at the point where a credit card was required.

I used Bourne shell for all my scripts, but there's plenty of other options,
such as C, Perl, PHP, python, etc. Once you pick the language you want to
use, that'll determine how you access the database and what other options
are available to you.

Ralph Mitchell

-------------------------------------------------------------------
List admin: http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-users
FAQ: http://curl.haxx.se/docs/faq.html
Etiquette: http://curl.haxx.se/mail/etiquette.html
Received on 2009-04-19