html - Python, Limiting search at a specific hyperlink on webpage -
I'm getting a way to download .pdf file through a hyperlink to a webpage.
The way is:
import lxml.html, urllib2, urlparse base_url = 'http://www.renderx.com/demos/examples.html' Res = urllib2.urlopen (base_url) tree = Lxml.html .fromstring (res.read ()) ns = {'re': 'http://exslt.org/regular-expressions'} tree.xpath for node ( '// a [re: test (@hrref,' \ .pdf $ ',' i ')]', namespace = ns): print urlparse.urljoin (base_url, node.attrib ['href'])
The question is, instead of listing all the PDFs on the webpage, how can I get a PDF only under specific hyperlinks?
There is a way, like:
'CA-Personal PDF' in the node:
But if the .pdf file is renamed What's going on? Or do I just want to limit the search to the webpage on the "app" hyperlink? Thank you.
OK, not the best way, but no harm:
Importbeautiful soup from BS 4 import urllib2 domain = 'http://www.renderx.com' url = 'http://www.renderx.com/demos/examples.html' page = urllib2.urlopen ( Url) = beautiful soup for AP in the soup app (page. Read ()) = soup.find_all ('a', text = "application"): print domain + aa ['href']
Comments
Post a Comment