Scrapy Web Crawler Tuning
Project Title: Scrapy Web Crawler Tuning
I have an existing scrapy script that crawls redfin.com to extract all listings.
It takes a URL and starts crawling: https://www.redfin.com/city/12914/CA/Napa/filter/property-type=house+condo+townhouse+multifamily,max-sqft=2.5k-sqft,max-days-on-market=1wk,include=forsale+mlsfsbo+construction,status=active,viewport=38.42491:38.1441:-122.03737:-122.54274,no-outline
Recently, redfin started bot detection and blocked the crawl around 70 hits or so.
You need to know scrapy enough to fix that.
You can break down the geographical search area to reduce the number of hits per crawl.
Also, I would like to crawl listing of all the counties in California
You can break down the counties into cities to keep the hits below threshold.
If some cities have too many lists (like Los Angeles), you’ll need to split the cities into zip codes.
I would like to be able to crawl the above areas at least twice weekly.
For similar work requirements feel free to email us on firstname.lastname@example.org.