- Aug 23, 2020
- Jul 30, 2020
-
-
ale authored
-
- Feb 17, 2020
- Jan 20, 2019
-
-
ale authored
Introduce an interface to decouple the Enqueue functionality from the Crawler implementation.
-
- Jan 02, 2019
-
-
ale authored
The output stage can now write to size-limited, rotating WARC files using a user-specified pattern, so that output files are always unique.
-
- Dec 06, 2018
-
-
ale authored
-
- Sep 02, 2018
-
-
ale authored
Allow users to add to the exclude regexp lists easily.
-
- Aug 31, 2018
-
-
ale authored
-
ale authored
Makes it possible to retry requests for temporary HTTP errors (429, 500, etc).
-
ale authored
Handler errors are fatal, so that an error writing the WARC output will cause the crawl to abort.
-
ale authored
Detect write errors (both on the database and to the WARC output) and abort with an error message. Also fix a bunch of harmless lint warnings.
-
- Dec 19, 2017
-
-
ale authored
Defaults that are more suitable to real-world site archiving.
-
ale authored
-
ale authored
-
ale authored
-
ale authored
This change allows more complex scope boundaries, including loosening edges a bit to include related resources of HTML pages (which makes for more complete archives if desired).
-
- Jul 03, 2015
-
-
ale authored
-
- Jun 29, 2015
-
-
ale authored
-
- Jun 28, 2015
-
-
ale authored
-
- Dec 20, 2014
- Dec 19, 2014
-
-
ale authored
-