curl-and-php
FOLLOWLOCATION as true doesn't seem to work
Date: Sun, 6 May 2012 23:24:18 -0700
PHP curl users,
I can't figure out why I can't get the html source code of:
http://corpusdelespanol.org/x2.asp
even though I can get the source code of:
http://corpusdelespanol.org/x.asp
http://corpusdelespanol.org/x1.asp
http://corpusdelespanol.org/x3.asp
http://corpusdelespanol.org/x4.asp
My PHP script starts at x.asp and gets a cookie and then passes
sequentially through the webpages. For some reason I can't get the
source code of x2.asp. Instead, I get the source of blank.asp. When I
use a browser (whether, Firefox, Chrome, Opera) I can see the source
code of x2.asp accurately, but curl doesn't give it to me. Instead it
gives me the source code of blank.asp, even though I have
FOLLOWLOCATION as true. I'm at a loss. One idea: x2.asp as some style
tag definitions above the opening <html> tag while the other pages
don't; the others start with the opening <html> tag and then define
styles. Could that throw off curl? Are browsers able to correct this
while curl is not? Here's my PHP script:
start of script
_____________
<?php
if (empty($_GET['word'])) {
die("Need to specify the word as a GET parameter, i.e. ?word=<word>");
}
$ch = curl_init();
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
curl_setopt($ch,CURLOPT_HEADER,true);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_COOKIEJAR,'cookie.txt');
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Macintosh; Intel Mac
OS X 10_7_2) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.202
Safari/535.12011-10-16 20:21:13');
curl_setopt($ch,CURLOPT_URL,'http://corpusdelespanol.org/x.asp');
curl_exec($ch);
curl_setopt($ch,CURLOPT_URL,'http://corpusdelespanol.org/x1.asp');
curl_exec($ch);
curl_setopt($ch,CURLOPT_URL,'http://corpusdelespanol.org/x2.asp?chooser=seq&p='
. urlencode($_GET['word']) .
'&w2=&wl=4&wr=4&r1=&r2=&ipos1=-select-&B7=SEARCH&sec1=19&sec1=18&sec2=0&sortBy=freq&sortByDo2=alph&minfreq1=freq&freq1=4&freq2=4&numhits=100&kh=100&groupBy=words&whatshow=raw&saveList=no&corpus=cde&ownsearch=y&holder=&whatdo=seq&rand1=y&whatdo1=1&didRandom=n&minFreq=freq&s1=1&s2=2&s3=3&perc=mi');
$second_page = curl_exec($ch);
echo $second_page . "<br /><br /><br />";
curl_setopt($ch,CURLOPT_URL,'http://corpusdelespanol.org/x3.asp?r=23&w11='
. urlencode($_GET['word']));
$inpage = curl_exec($ch);
preg_match('|<span ID="w_section">SECTION</span>:
ORAL <b>\(([0-9,]*)\)</b></td>|Ui', $inpage, $matches);
echo str_replace(',', '', $matches[1]);
__________
end of script
Instead of giving me what I see as the source of x2.asp in a browser,
I get this header and then the source of blank.asp:
HTTP/1.1 302 Object moved
Date: Mon, 07 May 2012 05:49:52 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Location: blank.asp
Content-Length: 130
Content-Type: text/html
Cache-control: private
HTTP/1.1 200 OK
Date: Mon, 07 May 2012 05:49:52 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Content-Length: 755
Content-Type: text/html
Cache-control: private
I have assumed FOLLOWLOCATION as true would fix this problem, but it
doesn't. Any ideas?
Thank you for your help. Best, Earl Brown
-- Earl K. Brown, PhD Chair, Language ULR Committee Assistant Professor of Spanish Linguistics School of World Languages and Cultures California State University, Monterey Bay _______________________________________________ http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-phpReceived on 2012-05-07