python - How to set a time out in web crawler? -
I am very new to Python and am trying to develop a very simple web crawler. My web crawler works well, but it sticks on a link for a long time. How do I set a timeout function?
How to handle urllib2.HTTPError? Is my statement correct? Start = page.find ('& lt; a href =') if start == - 1: return no, 0 startp = page ('' '' 'Startup + 1) url = Page [startp + 1: endp] return URL, endp def get_all_link (page): allurl = [] while true: url, endp = Get_link (page) if url: page = page [endp:] allurl.append (url) other: return allurl break def get_page (page, tocrawl): import urllib2 try: page_source = urllib2.urlopen (page) valid back == - 1: Return 0 Returns 1 DRP crawler (seed): tocrawl = [beed] crawl = [] I = 0 while torawrawl: page = tocrawl.pop () Valid = Validate (page) is valid: if page crawls Is not: tocrawl = se (tocrawl) | Set (get_all_link (get_page (page, tocroll))) crawl. Append (page) i = i + 1 f = open ("crawled.txt", "a") f.write (repr (i) + ":" + ("http://google.com")
Comments
Post a Comment