In Mac OS and Python 2.7,  Scrapy has an issue, which can not read request.seen file. So when you use JOBDIR setting to pause and resume task, the duplicate filter does not work.
You have to change the source code of Scapy
Path: /Library/Python/2.7/site-packages/scrapy/dupefilter.py
Red line is added.
Reference:
http://stackoverflow.com/questions/14639936/how-to-read-from-file-opened-in-a-mode
You have to change the source code of Scapy
Path: /Library/Python/2.7/site-packages/scrapy/dupefilter.py
class RFPDupeFilter(BaseDupeFilter):
    """Request Fingerprint duplicates filter"""
    def __init__(self, path=None, debug=False):
        self.file = None        self.fingerprints = set()
        self.logdupes = True        self.debug = debug
        if path:
            self.file = open(os.path.join(path, 'requests.seen'), 'a+')
            self.file.seek(0)
            self.fingerprints.update(x.rstrip() for x in self.file)
Red line is added.
Reference:
http://stackoverflow.com/questions/14639936/how-to-read-from-file-opened-in-a-mode
Comments
Post a Comment