cURL / Mailing Lists / curl-library / Single Mail

curl-library

Re: Save as text, lynx -dump

From: Jeff Pohlmeyer <yetanothergeek_at_yahoo.com>
Date: Wed, 7 May 2003 23:56:03 -0700 (PDT)

> I want to save a webpage as text using an API,
> rather than using a system call to "lynx".

You could probably do a cleaner ( and lighter ) system call
using html2text -nobs
  http://userpage.fu-berlin.de/~mbayer/tools/html2text.html

As far as API's, there are a few around,
but they all have some drawbacks...

libxml has an html parser, but it is not very forgiving.

The gtkhtml widget has a class called "html_tokenizer" that
is reasonably simple, but the dependencies and overhead
are probably much heavier than making a system call.

el-kabong is a very lightweight parser, but it is a bit weak:
  http://ekhtml.sourceforge.net/

The most robust parser I have tried is libtidy, but you can
expect to spend some time figuring out the interface:
  http://tidy.sourceforge.net/

Maybe somebody else has a better idea ?

 - Jeff

__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com
Received on 2003-05-08