curl-library
Re: Save as text, lynx -dump
Date: Wed, 7 May 2003 23:56:03 -0700 (PDT)
> I want to save a webpage as text using an API,
> rather than using a system call to "lynx".
You could probably do a cleaner ( and lighter ) system call
using html2text -nobs
http://userpage.fu-berlin.de/~mbayer/tools/html2text.html
As far as API's, there are a few around,
but they all have some drawbacks...
libxml has an html parser, but it is not very forgiving.
The gtkhtml widget has a class called "html_tokenizer" that
is reasonably simple, but the dependencies and overhead
are probably much heavier than making a system call.
el-kabong is a very lightweight parser, but it is a bit weak:
http://ekhtml.sourceforge.net/
The most robust parser I have tried is libtidy, but you can
expect to spend some time figuring out the interface:
http://tidy.sourceforge.net/
Maybe somebody else has a better idea ?
- Jeff
__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com
-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com
Received on 2003-05-08