Buy commercial curl support from WolfSSL. We help you work
out your issues, debug your libcurl applications, use the API, port to new
platforms, add new features and more. With a team lead by the curl founder
himself.
Re: I need help getting a web page
- Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ] [ by messages with attachments ]
From: Hans Henrik Bergan via curl-users <curl-users_at_lists.haxx.se>
Date: Tue, 12 Oct 2021 12:04:28 +0200
try digging this company name out of the HTML:
<span>AT&T</span>
the correct translation, as a proper HTML parser will get you: AT&T
what a regex extraction will get you: AT&T
try digging the title out of this link:
<a href="foo" title="5>3"> Mathematical proof that 5 is greater than 3! </a>
a regex extraction is very likely to fail here, and extract 3">
Mathematical(...)
while a proper HTML parser will have no problem, and correctly parse out
"Mathematical proof that 5 is greater than 3!"
but it's only broken code, not life and death.
On Tue, 12 Oct 2021 at 11:16, ToddAndMargo via curl-users <
curl-users_at_lists.haxx.se> wrote:
> On 10/12/21 00:02, Hans Henrik Bergan via curl-users wrote:
> > https://stackoverflow.com/a/1732454/1067003
>
> "You can't parse [X]HTML with regex. Because HTML
> can't be parsed by regex. Regex is not a tool that
> can be used to correctly parse HTML"
>
> Just watch me! I dig things out of html code all
> the time. Probably not "parsing" though. Raku's
> regex eats html alive!
>
> --
> Unsubscribe: https://lists.haxx.se/listinfo/curl-users
> Etiquette: https://curl.haxx.se/mail/etiquette.html
>
Date: Tue, 12 Oct 2021 12:04:28 +0200
try digging this company name out of the HTML:
<span>AT&T</span>
the correct translation, as a proper HTML parser will get you: AT&T
what a regex extraction will get you: AT&T
try digging the title out of this link:
<a href="foo" title="5>3"> Mathematical proof that 5 is greater than 3! </a>
a regex extraction is very likely to fail here, and extract 3">
Mathematical(...)
while a proper HTML parser will have no problem, and correctly parse out
"Mathematical proof that 5 is greater than 3!"
but it's only broken code, not life and death.
On Tue, 12 Oct 2021 at 11:16, ToddAndMargo via curl-users <
curl-users_at_lists.haxx.se> wrote:
> On 10/12/21 00:02, Hans Henrik Bergan via curl-users wrote:
> > https://stackoverflow.com/a/1732454/1067003
>
> "You can't parse [X]HTML with regex. Because HTML
> can't be parsed by regex. Regex is not a tool that
> can be used to correctly parse HTML"
>
> Just watch me! I dig things out of html code all
> the time. Probably not "parsing" though. Raku's
> regex eats html alive!
>
> --
> Unsubscribe: https://lists.haxx.se/listinfo/curl-users
> Etiquette: https://curl.haxx.se/mail/etiquette.html
>
-- Unsubscribe: https://lists.haxx.se/listinfo/curl-users Etiquette: https://curl.haxx.se/mail/etiquette.htmlReceived on 2021-10-12