python 2.7 - How to iterate link inside unknown total page number? -


i want application link inside every pages.but problem total page inside each category not same. have code:

import urllib bs4 import beautifulsoup  url ='http://www.brothersoft.com/windows/mp3_audio/' pageurl = urllib.urlopen(url) soup = beautifulsoup(pageurl)  in soup.select('div.coleft.cate.mbottom dd a[href]'):         print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')         suburl = 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')          page in range(1,27+1):                 content = urllib.urlopen(suburl+'{}.html'.format(page))                 soup = beautifulsoup(content)                 in soup.select('div.freetext dl a[href]'):                         print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce') 

but link of application 27 pages in each category. if other category not have 27 pages or more 27 pages?

you can extract number of programs , devide 20. example, if open url - http://www.brothersoft.com/windows/photo_image/font_tools/2.html then:

import re import urllib bs4 import beautifulsoup  tmp = re.compile("1-(..)") url ='http://www.brothersoft.com/windows/photo_image/font_tools/2.html' pageurl = urllib.urlopen(url) soup = beautifulsoup(pageurl) pages = soup.find("div", {"class":"freemenu coleft menubox"}) page = pages.text print int(re.search(r'of ([\d]+) ', page).group(1)) / 20 + 1 

output be:

18 

for http://www.brothersoft.com/windows/photo_image/cad_software/6.html url output 108.

so need open page can find how many pages. scrap number, , can run loop.it this:

import re import urllib bs4 import beautifulsoup  tmp = re.compile("1-(..)") url ='http://www.brothersoft.com/windows/photo_image/' pageurl = urllib.urlopen(url) soup = beautifulsoup(pageurl)  in soup.select('div.coleft.cate.mbottom dd a[href]'):         suburl = 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')         print suburl          content = urllib.urlopen(suburl+'2.html')         soup1 = beautifulsoup(content)         pages = soup1.find("div", {"class":"freemenu coleft menubox"})         page = pages.text         allpages =  int(re.search(r'of ([\d]+) ', page).group(1)) / 20 + 1         print allpages         page in range(1, allpages+1):                 content = urllib.urlopen(suburl+'{}.html'.format(page))                 soup = beautifulsoup(content)                 in soup.select('div.freetext dl a[href]'):                         print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce') 

Comments

Popular posts from this blog

Unable to remove the www from url on https using .htaccess -