diff --git a/README.md b/README.md index 8399fc6aeee19f84e20025b73b6c884148601f7e..5a4e8f5cf5d3eb549f558e2ce94527c52cdc2c0e 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,8 @@ its ability to *scale down* for small installations, using very few resources while maintaining a certain level of usefulness, offering an alternative to heavyweight stacks like ELK in this scenario. +[[_TOC_]] + ## Overview The system's functionality is split into two parts: @@ -111,9 +113,9 @@ The flattened records are then written to periodically (and when they reach a certain size). These files can be stored remotely, on S3-like backends. -The ingestion API endpoint is at */ingest*, and it expects a POST -request with a ND-JSON request body: newline-delimited JSON-encoded -records, no additional headers or footers. +The ingestion API endpoint is at `/ingest`, and it expects a POST +request with a ND-JSON request body (newline-delimited JSON-encoded +records, no additional headers or footers). ### Schema unification @@ -151,7 +153,7 @@ you won't see logs until the ingestion server decides it's time to finalize the current Parquet file. For this reason, it might be sensible to set the *--rotation-interval* option to a few minutes. -The query API is at */query* and it takes a full SQL query as the *q* +The query API is at `/query` and it takes a full SQL query as the *q* parameter. The response will be JSON-encoded. Since the table to query is created on-the-fly with every request, its name is not known in advance to the caller: the SQL query should contain the placeholder @@ -214,10 +216,8 @@ the URI scheme: * *minio* - Generic S3-like API support. Use standard environment variables (MINIO_ACCESS_KEY etc) for credentials, URIs should have this form: `minio://hostname/bucket/path` - * *s3* - AWS S3 (not ready yet). Supports URIs like `s3://bucket/path` - * *gcs* - Google Cloud Storage (not ready yet). Supports URIs of the form `gcs://project_id/bucket/path` @@ -227,9 +227,9 @@ the URI scheme: The server offers some debugging endpoints which might be useful to understand what it is doing: -* */schema* will return the current schema in JSON format -* */debug/schema* will return a human-readable dump of the internal - state of the schema guesser +* `/schema` will return the current schema in JSON format +* `/debug/schema` will return a human-readable dump of the internal + state of the schema guesser, where you can find a report on the errors encountered ### Performance and Scaling @@ -261,3 +261,10 @@ is certainly possible to run multiple instances of *pqlogd* in parallel, pointing them at the same storage: generated filenames are unique, so the query layer will maintain the aggregate view of all logs. + +Note that multiple instances of the indexer will each run their own, +independent schema analysis, which can potentially result in different +schemas depending on the input. This is not an issue, because what +matters is that the schema is consistent within each individual +Parquet file: the database engine can easily merge those together at +query time.