Re: I need help getting a web page

From: Hans Henrik Bergan via curl-users <>
Date: Tue, 12 Oct 2021 15:51:35 +0200

On Tue, 12 Oct 2021 at 13:17, ToddAndMargo via curl-users <> wrote:

> On 10/12/21 03:04, Hans Henrik Bergan via curl-users wrote:
> > ry digging this company name out of the HTML:
> > <span>AT&amp;T</span>
> >
> > the correct translation, as a proper HTML parser will get you: AT&T
> > what a regex extraction will get you: AT&amp;T
> > try digging the title out of this link:
> > <a href="foo" title="5>3"> Mathematical proof that 5 is greater than 3!
> </a>
> >
> > a regex extraction is very likely to fail here, and extract 3">
> > Mathematical(...)
> > while a proper HTML parser will have no problem, and correctly parse out
> > "Mathematical proof that 5 is greater than 3!"
> >
> > but it's only broken code, not life and death.
> I am basically looking for links and revisions.
> But if I had to deal with
> <body>
> AT&amp;T
> </body>
> I'd probably do a
> $ raku
> Welcome to 𝐑𝐚𝐤𝐮𝐝𝐨™ v2021.07.
> Implementing the 𝐑𝐚𝐤𝐮™ programming language v6.d.
> Built on MoarVM version 2021.07.
> To exit type 'exit' or '^D'
> > my $x = Q[AT&amp;T];
> AT&amp;T
> > $x~~s/ ('AT&amp;T') /AT&T/;
> 「AT&amp;T」
> 0 => 「AT&amp;T」
> > say $x
> AT&T
> Revisions and link never have odd characters in them.
> It is far easier for me to just go straight to the
> code itself than trying translating it to text. Keep
> in mind that I know the pattern I am looking for and
> the rest of the page is just noise to be discarded.
> My biggest difficultly is having to go into
> hexedit to find unprintable characters, but I
> have gotten pretty good at figuring out when
> that is happening and working around them. This
> usually happens when a web designer mixes UTF-8
> and UTF-16 together by accident. I am in UTF-8.
> --
> Unsubscribe:
Etiquette:

Received on 2021-10-12