cURL / Mailing Lists / curl-users / Single Mail

curl-users

URL parsing ehchancement

From: Vlad Krupin <vlad_at_echospace.com>
Date: Wed, 09 Apr 2003 16:23:46 -0700

Hi,

I have been fixing a hotmail scraper script that uses curl, and noticed
that, hotmail now uses URLs like

http://example.com?param=blah...

Notice the missing slash after 'example.com' but before the question mark.

I do not know if this is a valid URL, but nevertheless both IE and
mozilla, and even lynx didn't mind it while curl did - it tried to
append the query string to the hostname part and resolve that ugliness.

I have created a small patch to allow parsing of URLs like that. I am
not an expert in curl or C, so I did it to the best of my understanding.
  It works for me. If there is a better way to do that, please correct
the patch - it shouldn't be too hard.

If it looks alright, can someone apply the patch? The patch was made
against version 7.10.4. Thanks!

P.S. I am not subscribed to the list. Please, cc: me when replying.

Vlad

-- 
Vlad Krupin
Software Engineer
echospace.com

--- url.c-old Wed Apr 9 16:04:33 2003
+++ url.c Wed Apr 9 16:14:19 2003
@@ -1921,9 +1921,13 @@
     /* Set default host and default path */
     strcpy(conn->gname, "curl.haxx.se");
     strcpy(conn->path, "/");
-
+ /* we need to search for '/' OR '?' - whichever comes first after host
+ * name but before the path. We need to change that to handle things
+ * like http://example.com?param= (notice the missing '/'). Later we'll
+ * insert that missing slash at the beginning of the path.
+ */
     if (2 > sscanf(data->change.url,
- "%64[^\n:]://%512[^\n/]%[^\n]",
+ "%64[^\n:]://%512[^\n/?]%[^\n]",
                    conn->protostr, conn->gname, conn->path)) {
 
       /*
@@ -1974,6 +1978,14 @@
 
   buf = data->state.buffer; /* this is our buffer */
 
+ /* If URL is malformed (missing a '/' after hostname before path)
+ * we insert a slash here
+ */
+ if(conn->path[0] == '?'){
+ strcpy(&conn->path[1],conn->path);
+ conn->path[0] = '/';
+ }
+
   /*
    * So if the URL was A://B/C,
    * conn->protostr is A

-------------------------------------------------------
This SF.net email is sponsored by: Etnus, makers of TotalView, The debugger
for complex code. Debugging C/C++ programs can leave you feeling lost and
disoriented. TotalView can help you find your way. Available on major UNIX
and Linux platforms. Try it free. www.etnus.com
Received on 2003-04-10