python 2.7 - How to iterate link inside unknown total page number? -
i want application link inside every pages.but problem total page inside each category not same. have code:
import urllib bs4 import beautifulsoup url ='http://www.brothersoft.com/windows/mp3_audio/' pageurl = urllib.urlopen(url) soup = beautifulsoup(pageurl) in soup.select('div.coleft.cate.mbottom dd a[href]'): print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce') suburl = 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce') page in range(1,27+1): content = urllib.urlopen(suburl+'{}.html'.format(page)) soup = beautifulsoup(content) in soup.select('div.freetext dl a[href]'): print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')
but link of application 27 pages in each category. if other category not have 27 pages or more 27 pages?
you can extract number of programs , devide 20. example, if open url - http://www.brothersoft.com/windows/photo_image/font_tools/2.html
then:
import re import urllib bs4 import beautifulsoup tmp = re.compile("1-(..)") url ='http://www.brothersoft.com/windows/photo_image/font_tools/2.html' pageurl = urllib.urlopen(url) soup = beautifulsoup(pageurl) pages = soup.find("div", {"class":"freemenu coleft menubox"}) page = pages.text print int(re.search(r'of ([\d]+) ', page).group(1)) / 20 + 1
output be:
18
for http://www.brothersoft.com/windows/photo_image/cad_software/6.html url output 108
.
so need open page can find how many pages. scrap number, , can run loop.it this:
import re import urllib bs4 import beautifulsoup tmp = re.compile("1-(..)") url ='http://www.brothersoft.com/windows/photo_image/' pageurl = urllib.urlopen(url) soup = beautifulsoup(pageurl) in soup.select('div.coleft.cate.mbottom dd a[href]'): suburl = 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce') print suburl content = urllib.urlopen(suburl+'2.html') soup1 = beautifulsoup(content) pages = soup1.find("div", {"class":"freemenu coleft menubox"}) page = pages.text allpages = int(re.search(r'of ([\d]+) ', page).group(1)) / 20 + 1 print allpages page in range(1, allpages+1): content = urllib.urlopen(suburl+'{}.html'.format(page)) soup = beautifulsoup(content) in soup.select('div.freetext dl a[href]'): print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')
Comments
Post a Comment