cURL / Mailing Lists / curl-users / Single Mail

curl-users

Re: CURL and advertising

From: Alessandro Vesely <vesely_at_tana.it>
Date: Tue, 13 Nov 2007 08:44:17 +0100

Jean Marie COUPRIE wrote:
>
>>> I have written a few scripts using Rexx Regina + rexxcurl with the help
>>> of liveHTTPHeaders extension of FireFox.
>>> It is not so easy and there is the problem of some servers that pay
>>> their services by imposing to the user to look at some advertising
>>> screens. Apparently there are 2 URLs : the main to the application
>>> domain and the other to the people that manage advertising. Curl seems
>>> to log and show only the sources of the pages to an from the application
>>> and nothing of the messages (cookies) to or from the other URL.
>>
>> Curl only downloads what you explicitly ask it to download.
>> It does not download images in an html page. It doesn't even
>> parse HTML or interpret JavaScript...
>>
> How can I ask Curl to put in variables the cookies to or from the URL
> that manage advertising so that I can parse them with my program?

From the man page, http://curl.haxx.se/docs/manpage.html
--cookie-jar <file name> to store cookies
--cookie <file name> to pass them along with the request.

Note that cookies live in the http protocol headers, not in
the actual page content.

>>> At some
>>> stages I have to click on an image to choose the advertising,
>>
>> You "have" to?? Why?
>>
> This the "click to proceed" sort of thing sent by the main application.

"Click to proceed" may yield more ads, but it is different from
clicking _on_ an ad (click through). The url corresponding to the
"click to proceed" is written inside the html page, it is either
in the form

  <a href="the url to GET">click to proceed</a>

or

  <form action="the url to POST or GET">...
    <input type="submit" value="Click to proceed"></form>

You can copy and paste the url if it is always the same. If not,
you need to parse the content of the page to learn the url.

>>> at this
>>> point 2 URL messages are sent one to the application saying "proceed",
>>> the consequence of the other being that I receive a screen for a company
>>> that try to sell me something (obviously I just close this screen in 99%
>>> of the cases...). This 2nd message often include variable informations
>>> (e.g. text=1234) that seems taken for the cookies that I have not seen
>>> with curl (only with liveHTTPHeaders), so I cannot simulate this
>>> message. For my purpose the first message is sufficient. BUT I assume
>>> that the application server says to the advertising server "I have
>>> generated n clicks to you, pay me x $" but the other answers "I have
>>> just received m (with n>m because mine are missing!)" They may discover
>>> that I have used an automate that does not look at their advertising
>>> and be very unhappy!
>>
>> How they reckon impressions and clicks shouldn't be your problem.
>>
> It is my problem because the application server give to the the
> advertising manager many information on me a number, my birth date, my
> town (not my name and address it will be illegal), if the advertiser
> think that I fool the normal process he can complain to the application
> that can revoke my userid...

Hm... if you have a userid, you had to login. I'd be inclined to think
that the server gives you a cookie to ensure that you did log in and
recover your session data (e.g. from a temporary file whose names can be
found in the cookie value.) Independently of how cluttered with ads each
page can be. Recall that http, without cookies, is stateless.

>> Besides curl users, indexing bots and CUI browsers (e.g. lynx) avoid
>> downloading images. Browsers with no enabled JavaScript engine are also
>> quite common.
>>
> They will receive a message that they cannot work without cookies and
> Javascript enabled...

Curl can manage cookies. Not JavaScript. You need to work out what
JavaScript is required for in order to automate the process. The
script can change form field values and also make http requests.

>>> Is there a solution with cURL to send the message to the advertising
>>> server to make it happy ?
>>
>> Telling curl to send a request will let it do so. Happiness is beyond
>> its reach...
>>
> I have the URL to send a request but not the information to put after it
> (in the format xyz=1234 with xyz fixed but 1234 variable) that I have to
> find by parsing a cookie see upper. Without this I cannot simulate a
> genuine message...

The cookie will be in curl's cookie jar after the first request.
Probably you (also) have to parse the page's content. Save that to
a page and use regular expression (e.g. in perl or sed) to extract
field values from the template that the server used to synthesize the
content it served to you. That way you prepare the next curl request.

If that's overly complicate, it may be more fit to use curl library
directly from perl or whatever language you are comfortable to use
for parsing, rather than invoking curl from a shell.
Received on 2007-11-13