- 26 Sep, 2019 2 commits
- 20 Jan, 2019 1 commit
-
-
ale authored
Introduce an interface to decouple the Enqueue functionality from the Crawler implementation.
-
- 19 Jan, 2019 1 commit
-
-
ale authored
The whole URLInfo structure, while neat, is unused except for the purpose of verifying if we have already seen a specific URL. The presence check is also now limited to Enqueue().
-
- 02 Jan, 2019 1 commit
-
-
ale authored
The output stage can now write to size-limited, rotating WARC files using a user-specified pattern, so that output files are always unique.
-
- 28 Dec, 2018 1 commit
-
-
ale authored
-
- 27 Dec, 2018 2 commits
- 06 Dec, 2018 1 commit
-
-
ale authored
-
- 02 Sep, 2018 4 commits
- 31 Aug, 2018 7 commits
-
-
ale authored
-
ale authored
-
ale authored
Makes it possible to retry requests for temporary HTTP errors (429, 500, etc).
-
ale authored
Handler errors are fatal, so that an error writing the WARC output will cause the crawl to abort.
-
ale authored
-
ale authored
Detect write errors (both on the database and to the WARC output) and abort with an error message. Also fix a bunch of harmless lint warnings.
-
ale authored
-
- 30 Aug, 2018 3 commits
- 19 Dec, 2017 11 commits
-
-
ale authored
Defaults that are more suitable to real-world site archiving.
-
ale authored
-
ale authored
-
ale authored
-
ale authored
-
ale authored
-
ale authored
-
ale authored
-
ale authored
-
ale authored
-
ale authored
This change allows more complex scope boundaries, including loosening edges a bit to include related resources of HTML pages (which makes for more complete archives if desired).
-
- 18 Dec, 2017 5 commits
- 03 Jul, 2015 1 commit
-
-
ale authored
-