- Aug 24, 2020
- Aug 23, 2020
- Aug 20, 2020
-
-
ale authored
-
- Jul 30, 2020
- Feb 17, 2020
- Dec 04, 2019
-
-
ale authored
-
- Nov 13, 2019
- Oct 07, 2019
- Sep 26, 2019
- Jan 20, 2019
-
-
ale authored
Introduce an interface to decouple the Enqueue functionality from the Crawler implementation.
-
- Jan 19, 2019
-
-
ale authored
The whole URLInfo structure, while neat, is unused except for the purpose of verifying if we have already seen a specific URL. The presence check is also now limited to Enqueue().
-
- Jan 02, 2019
-
-
ale authored
The output stage can now write to size-limited, rotating WARC files using a user-specified pattern, so that output files are always unique.
-
- Dec 28, 2018
-
-
ale authored
-
- Dec 27, 2018
- Dec 06, 2018
-
-
ale authored
-
- Sep 02, 2018
- Aug 31, 2018
-
-
ale authored
-
ale authored
-
ale authored
Makes it possible to retry requests for temporary HTTP errors (429, 500, etc).
-
ale authored
Handler errors are fatal, so that an error writing the WARC output will cause the crawl to abort.
-
ale authored
-
ale authored
Detect write errors (both on the database and to the WARC output) and abort with an error message. Also fix a bunch of harmless lint warnings.
-
ale authored
-
- Aug 30, 2018