cURL / Mailing Lists / curl-users / Single Mail

curl-users

Re: Help Using command line cURL with an ASP site

From: Mike Juvrud <mike_at_mudlabs.com>
Date: Sat, 12 Jun 2004 13:36:27 -0500

Previously I asked (a couple days ago) ...
[I am looking for help in extracting records from an online database.
The data is presented as a webpage for each record - There are 5 pages
per record. I need to get each page for each of the 12000+ records....]
 
I was able to get a cURL script to work by storing and using the cookie
command line options as suggested (and using HTTPLiveHeaders). Thanks
for that tip! (see below for what I did)
 
However, the server appears to only have allowed me to download the
first 120 pages of data (of ~60000). (for the first 120 pages I got the
real stuff - after that it returned a custom error page for each
record). Now I cannot access the records that I did not already download
(via the script) - either via cURL OR in a web browser. (prior to the
successful curl script execution - I could browse all 60000 - now I can
only see or access the 120 I downloaded via curl).
 
It appears that the database has locked me out somehow. There are 4
other similar databases on the same server (for different counties) -
these I can still access just fine via my browser - but have not dared
attempt another cURL script in the event I am locked out of those as
well.
 
Could there be some check that happens on the server to block automation
scripts from accessing the data? Could this check be time based (only x
number of requests are allowed per minute from an IP address) or some
other process that can be worked around somehow?
**********************
The site that hosts the data access is:
HYPERLINK
"http://morris.state.mn.us/tax/tax.asp"http://morris.state.mn.us/tax/tax
.asp
**********************
Here is a summary of what I did:
1) [COMMAND LINE - store cookie]
    curl -c newcookies.txt "HYPERLINK
"http://morris.state.mn.us/tax/tax.asp"http://morris.state.mn.us/tax/tax
.asp"
 
2) [MOZILLA - went to the webpage (above line) - clicked on the "Pope
County" Link -> Accepted the disclaimer agreement]
 
3) [Opened the cookie file "newcookies.txt" - modified value to what was
listed in HTTPLiveHeaders to fake acceptance of the disclaimer]
 
4) [Created a config file with a list of the pages to get (about 1000)
and the filenames to store them as]
(Sample config file contents)
url = "HYPERLINK
"http://morris.state.mn.us/tax/Parcel.asp?pid=%2001-0020-000&tid=0"http:
//morris.state.mn.us/tax/Parcel.asp?pid=%2001-0020-000&tid=0"
-o 01-0020-000-0.html
url = "HYPERLINK
"http://morris.state.mn.us/tax/Parcel.asp?pid=%2001-0020-000&tid=1"http:
//morris.state.mn.us/tax/Parcel.asp?pid=%2001-0020-000&tid=1"
-o 01-0020-000-1.html
url = "HYPERLINK
"http://morris.state.mn.us/tax/Parcel.asp?pid=%2001-0020-000&tid=2"http:
//morris.state.mn.us/tax/Parcel.asp?pid=%2001-0020-000&tid=2"
-o 01-0020-000-2.html
 
5) [COMMAND LINE - download pages using config file]
    curl -b newcookies.txt -K curlconfigPope.txt
 
6) [RESULT -- the first 120 pages were stored successfully - starting at
the 121st download - all pages were error pages and it appears I am now
locked out of the database from any record except the 120 I successfully
accessed via cURL.]
**********************
 
I deleted all cookies and it had no effect - no matter what browser I
use (Mozilla, FireFox, IE) - I cannot access records beyond what I
accessed via the initial (120 records) cURL script execution.
 
Any suggestions or comments about what is happening?
 
Should I contact the site administrator to get the details on an
automation policy? If I have a clue about what I need - I can have them
change something on the server to allow for this access as it is all
public data and my accessing via this method saves them time processing
our requests. (that's what the site was built for anyway)
 
Hopefully, I can get cURL to work so that I don't have to go through the
incredible hassle of getting the data from the county clerks. Plus, I
was hoping that this method would enable us to keep our DB up-to-date
without waiting 3-weeks for the data to appear in our email box.
 
Thanks in advance!
 
*********************************
Mike Juvrud
Programmer
Mudlabs
320.634.4410
Glenwood, MN, USA
HYPERLINK "mailto:mike_at_mudlabs.com"mike_at_mudlabs.com
HYPERLINK "http://www.mudlabs.com/"www.mudlabs.com
*********************************
 

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.703 / Virus Database: 459 - Release Date: 6/10/2004
 
Received on 2004-06-13