curl-and-python
Program dies on call of multi.select...
Date: Tue, 12 Aug 2014 15:09:47 -0700
I have a rather complicated crawler that seems to die often - but not
always at the same place.
What's exasperating is that there is no exceptions, stack traces, etc.,
printed. I was only able to find where it died by adding lots of print
statements, and seeing what was the last thing to be printed.
Here's a somewhat simplified version of the code:
multi = pycurl.CurlMulti()
print("ag2")
now = datetime.datetime.utcnow()
print("ag3")
for counter, website in enumerate(websites, 1):
print("ag4")
assert website.crawl_type in ('standard', 'refresh', 'new')
print("ag5")
website.grabber = WebSite.Resource(website.next_page.original_url,
anonymous=Options.anonymous)
print("ag6")
website.next_page.crawled_ts = now
print("ag7")
multi.add_handle(website.grabber._curl)
print("ag8")
print("ag9")
# Number of seconds to wait for a timeout to happen
if Options.test:
SELECT_TIMEOUT = 30.0 # Set for longer cause blicker_pierce takes
forever
# on the additional start page with all
the wines
else:
SELECT_TIMEOUT = 10.0
print("ag10")
#To do: implement it this way
http://www.josefassad.com/pycurl_curlmulti_mini_howto
# Stir the state machine into action
while 1:
print("ag11")
ret, num_handles = multi.perform()
if ret != pycurl.E_CALL_MULTI_PERFORM:
break
print("ag12")
#CauseError
# Keep going until all the connections have terminated
while num_handles:
# The select method uses fdset internally to determine which file
descriptors
# to check.
# Todo: This code is looped a lot
# Should there be a sleep here???? I got no idea
print("ag12.5")
print("calling multi.select with:", SELECT_TIMEOUT)
print("Please don't die here!!!!")
multi.select(SELECT_TIMEOUT)
_______________________________________________
http://cool.haxx.se/cgi-bin/mailman/listinfo/curl-and-python
Received on 2014-08-13