curl / Mailing Lists / curl-users / Single Mail
Buy commercial curl support from WolfSSL. We help you work out your issues, debug your libcurl applications, use the API, port to new platforms, add new features and more. With a team lead by the curl founder himself.

Re: I need help getting a web page

From: Hans Henrik Bergan via curl-users <curl-users_at_lists.haxx.se>
Date: Tue, 12 Oct 2021 15:51:35 +0200

> I'd probably do a
$ raku
Welcome to 𝐑𝐚𝐤𝐮𝐝𝐨™ v2021.07.
Implementing the 𝐑𝐚𝐤𝐮™ programming language v6.d.
Built on MoarVM version 2021.07.

To exit type 'exit' or '^D'

> my $x = Q[AT&amp;T];
AT&amp;T


yeah..
 also keep in mind that &quot; must be translated to " and &#039; must be
translated to ' and &lt; must be translated to < and &gt; must be
translated to > and &nbsp; must be translated to and &iexcl; must be
translated to ¡ and &cent; must be translated to ¢ and &pound; must be
translated to £ and &curren; must be translated to ¤ and &yen; must be
translated to ¥ and &brvbar; must be translated to ¦ and &sect; must be
translated to § and &uml; must be translated to ¨ and &copy; must be
translated to © and &ordf; must be translated to ª and &laquo; must be
translated to « and &not; must be translated to ¬ and &shy; must be
translated to ­ and &reg; must be translated to ® and &macr; must be
translated to ¯ and &deg; must be translated to ° and &plusmn; must be
translated to ± and &sup2; must be translated to ² and &sup3; must be
translated to ³ and &acute; must be translated to ´ and &micro; must be
translated to µ and &para; must be translated to ¶ and &middot; must be
translated to · and &cedil; must be translated to ¸ and &sup1; must be
translated to ¹ and &ordm; must be translated to º and &raquo; must be
translated to » and &frac14; must be translated to ¼ and &frac12; must be
translated to ½ and &frac34; must be translated to ¾ and &iquest; must be
translated to ¿ and &Agrave; must be translated to À and &Aacute; must be
translated to Á and &Acirc; must be translated to  and &Atilde; must be
translated to à and &Auml; must be translated to Ä and &Aring; must be
translated to Å and &AElig; must be translated to Æ and &Ccedil; must be
translated to Ç and &Egrave; must be translated to È and &Eacute; must be
translated to É and &Ecirc; must be translated to Ê and &Euml; must be
translated to Ë and &Igrave; must be translated to Ì and &Iacute; must be
translated to Í and &Icirc; must be translated to Î and &Iuml; must be
translated to Ï and &ETH; must be translated to Ð and &Ntilde; must be
translated to Ñ and &Ograve; must be translated to Ò and &Oacute; must be
translated to Ó and &Ocirc; must be translated to Ô and &Otilde; must be
translated to Õ and &Ouml; must be translated to Ö and &times; must be
translated to × and &Oslash; must be translated to Ø and &Ugrave; must be
translated to Ù and &Uacute; must be translated to Ú and &Ucirc; must be
translated to Û and &Uuml; must be translated to Ü and &Yacute; must be
translated to Ý and &THORN; must be translated to Þ and &szlig; must be
translated to ß and &agrave; must be translated to à and &aacute; must be
translated to á and &acirc; must be translated to â and &atilde; must be
translated to ã and &auml; must be translated to ä and &aring; must be
translated to å and &aelig; must be translated to æ and &ccedil; must be
translated to ç and &egrave; must be translated to è and &eacute; must be
translated to é and &ecirc; must be translated to ê and &euml; must be
translated to ë and &igrave; must be translated to ì and &iacute; must be
translated to í and &icirc; must be translated to î and &iuml; must be
translated to ï and &eth; must be translated to ð and &ntilde; must be
translated to ñ and &ograve; must be translated to ò and &oacute; must be
translated to ó and &ocirc; must be translated to ô and &otilde; must be
translated to õ and &ouml; must be translated to ö and &divide; must be
translated to ÷ and &oslash; must be translated to ø and &ugrave; must be
translated to ù and &uacute; must be translated to ú and &ucirc; must be
translated to û and &uuml; must be translated to ü and &yacute; must be
translated to ý and &thorn; must be translated to þ and &yuml; must be
translated to ÿ and &OElig; must be translated to Œ and &oelig; must be
translated to œ and &Scaron; must be translated to Š and &scaron; must be
translated to š and &Yuml; must be translated to Ÿ and &fnof; must be
translated to ƒ and &circ; must be translated to ˆ and &tilde; must be
translated to ˜ and &Alpha; must be translated to Α and &Beta; must be
translated to Β and &Gamma; must be translated to Γ and &Delta; must be
translated to Δ and &Epsilon; must be translated to Ε and &Zeta; must be
translated to Ζ and &Eta; must be translated to Η and &Theta; must be
translated to Θ and &Iota; must be translated to Ι and &Kappa; must be
translated to Κ and &Lambda; must be translated to Λ and &Mu; must be
translated to Μ and &Nu; must be translated to Ν and &Xi; must be
translated to Ξ and &Omicron; must be translated to Ο and &Pi; must be
translated to Π and &Rho; must be translated to Ρ and &Sigma; must be
translated to Σ and &Tau; must be translated to Τ and &Upsilon; must be
translated to Υ and &Phi; must be translated to Φ and &Chi; must be
translated to Χ and &Psi; must be translated to Ψ and &Omega; must be
translated to Ω and &alpha; must be translated to α and &beta; must be
translated to β and &gamma; must be translated to γ and &delta; must be
translated to δ and &epsilon; must be translated to ε and &zeta; must be
translated to ζ and &eta; must be translated to η and &theta; must be
translated to θ and &iota; must be translated to ι and &kappa; must be
translated to κ and &lambda; must be translated to λ and &mu; must be
translated to μ and &nu; must be translated to ν and &xi; must be
translated to ξ and &omicron; must be translated to ο and &pi; must be
translated to π and &rho; must be translated to ρ and &sigmaf; must be
translated to ς and &sigma; must be translated to σ and &tau; must be
translated to τ and &upsilon; must be translated to υ and &phi; must be
translated to φ and &chi; must be translated to χ and &psi; must be
translated to ψ and &omega; must be translated to ω and &thetasym; must be
translated to ϑ and &upsih; must be translated to ϒ and &piv; must be
translated to ϖ and &ensp; must be translated to and &emsp; must be
translated to and &thinsp; must be translated to and &zwnj; must be
translated to ‌ and &zwj; must be translated to ‍ and &lrm; must be
translated to ‎ and &rlm; must be translated to ‏ and &ndash; must be
translated to – and &mdash; must be translated to — and &lsquo; must be
translated to ‘ and &rsquo; must be translated to ’ and &sbquo; must be
translated to ‚ and &ldquo; must be translated to “ and &rdquo; must be
translated to ” and &bdquo; must be translated to „ and &dagger; must be
translated to † and &Dagger; must be translated to ‡ and &bull; must be
translated to • and &hellip; must be translated to … and &permil; must be
translated to ‰ and &prime; must be translated to ′ and &Prime; must be
translated to ″ and &lsaquo; must be translated to ‹ and &rsaquo; must be
translated to › and &oline; must be translated to ‾ and &frasl; must be
translated to ⁄ and &euro; must be translated to € and &image; must be
translated to ℑ and &weierp; must be translated to ℘ and &real; must be
translated to ℜ and &trade; must be translated to ™ and &alefsym; must be
translated to ℵ and &larr; must be translated to ← and &uarr; must be
translated to ↑ and &rarr; must be translated to → and &darr; must be
translated to ↓ and &harr; must be translated to ↔ and &crarr; must be
translated to ↵ and &lArr; must be translated to ⇐ and &uArr; must be
translated to ⇑ and &rArr; must be translated to ⇒ and &dArr; must be
translated to ⇓ and &hArr; must be translated to ⇔ and &forall; must be
translated to ∀ and &part; must be translated to ∂ and &exist; must be
translated to ∃ and &empty; must be translated to ∅ and &nabla; must be
translated to ∇ and &isin; must be translated to ∈ and &notin; must be
translated to ∉ and &ni; must be translated to ∋ and &prod; must be
translated to ∏ and &sum; must be translated to ∑ and &minus; must be
translated to − and &lowast; must be translated to ∗ and &radic; must be
translated to √ and &prop; must be translated to ∝ and &infin; must be
translated to ∞ and &ang; must be translated to ∠ and &and; must be
translated to ∧ and &or; must be translated to ∨ and &cap; must be
translated to ∩ and &cup; must be translated to ∪ and &int; must be
translated to ∫ and &there4; must be translated to ∴ and &sim; must be
translated to ∼ and &cong; must be translated to ≅ and &asymp; must be
translated to ≈ and &ne; must be translated to ≠ and &equiv; must be
translated to ≡ and &le; must be translated to ≤ and &ge; must be
translated to ≥ and &sub; must be translated to ⊂ and &sup; must be
translated to ⊃ and &nsub; must be translated to ⊄ and &sube; must be
translated to ⊆ and &supe; must be translated to ⊇ and &oplus; must be
translated to ⊕ and &otimes; must be translated to ⊗ and &perp; must be
translated to ⊥ and &sdot; must be translated to ⋅ and &lceil; must be
translated to ⌈ and &rceil; must be translated to ⌉ and &lfloor; must be
translated to ⌊ and &rfloor; must be translated to ⌋ and &lang; must be
translated to 〈 and &rang; must be translated to 〉 and &loz; must be
translated to ◊ and &spades; must be translated to ♠ and &clubs; must be
translated to ♣ and &hearts; must be translated to ♥ and &diams; must be
translated to ♦




On Tue, 12 Oct 2021 at 13:17, ToddAndMargo via curl-users <
curl-users_at_lists.haxx.se> wrote:

> On 10/12/21 03:04, Hans Henrik Bergan via curl-users wrote:
> > ry digging this company name out of the HTML:
> > <span>AT&amp;T</span>
> >
> > the correct translation, as a proper HTML parser will get you: AT&T
> > what a regex extraction will get you: AT&amp;T
> > try digging the title out of this link:
> > <a href="foo" title="5>3"> Mathematical proof that 5 is greater than 3!
> </a>
> >
> > a regex extraction is very likely to fail here, and extract 3">
> > Mathematical(...)
> > while a proper HTML parser will have no problem, and correctly parse out
> > "Mathematical proof that 5 is greater than 3!"
> >
> > but it's only broken code, not life and death.
>
> I am basically looking for links and revisions.
>
> But if I had to deal with
>
> <body>
> AT&amp;T
> </body>
>
> I'd probably do a
>
> $ raku
> Welcome to 𝐑𝐚𝐤𝐮𝐝𝐨™ v2021.07.
> Implementing the 𝐑𝐚𝐤𝐮™ programming language v6.d.
> Built on MoarVM version 2021.07.
>
> To exit type 'exit' or '^D'
>
> > my $x = Q[AT&amp;T];
> AT&amp;T
>
> > $x~~s/ ('AT&amp;T') /AT&T/;
> 「AT&amp;T」
> 0 => 「AT&amp;T」
>
> > say $x
> AT&T
>
>
>
> Revisions and link never have odd characters in them.
>
> It is far easier for me to just go straight to the
> code itself than trying translating it to text. Keep
> in mind that I know the pattern I am looking for and
> the rest of the page is just noise to be discarded.
>
> My biggest difficultly is having to go into
> hexedit to find unprintable characters, but I
> have gotten pretty good at figuring out when
> that is happening and working around them. This
> usually happens when a web designer mixes UTF-8
> and UTF-16 together by accident. I am in UTF-8.
> --
> Unsubscribe: https://lists.haxx.se/listinfo/curl-users
> Etiquette: https://curl.haxx.se/mail/etiquette.html
>


-- 
Unsubscribe: https://lists.haxx.se/listinfo/curl-users
Etiquette:   https://curl.haxx.se/mail/etiquette.html
Received on 2021-10-12