A Swiss-army knife library for scraping and processing data from the web. Provides a unified interface for multiple different HTTP clients, and convenience functionality for parsing and preprocessing data for your applications to use.
- Quickly build HTTP requests for a variety of data formats and APIs.
- Parse common data formats such as XML, HTML, and JSON.
- Push your aggregated data automatically to your preferred database (such as Kafka, MySQL, or Postgres).
- Write your own collectors for non-standard data formats.
graph LR
EXT1[(External HTTP API)]
EXT2[(External HTTP API)]
EXT3[(External HTTP API)]
COL1[/Collector/]
COL2[/Collector/]
COL3[/Collector/]
DB[(Application Database)]
BE1[Backend Application]
BE2[Backend Application]
BE3[Backend Application]
subgraph AP[Aggregation Framework]
COL1
COL2
COL3
end
EXT1 --> COL1
EXT2 --> COL2
EXT3 --> COL3
COL1 & COL2 & COL3 --> DB --> BE1 & BE2 & BE3
Add Aggregation Framework and your preferred extensions to your project. For sbt:
// add Forge as a resolver
resolvers += "Gitea Package API" at "https://forge.cptlobster.dev/api/packages/cptlobster/maven"
libraryDependencies += "dev.cptlobster" %% "aggregation-framework-core" % "0.1.0-SNAPSHOT"
// for JSON parsing
libraryDependencies += "dev.cptlobster" %% "aggregation-framework-json" % "0.1.0-SNAPSHOT"Note: Snapshot versions are available here at forge.cptlobster.dev. Release versions will be made available on Maven Central at a future date.
To create a consumer, follow the tutorial.
The project is split into a collection of packages. These are split so that you don't have to install a ton of external packages that you aren't going to use.
The core package is located under /core in this repository, and the extension packages are located under their own
subdirectories in /ext. Each extension package has its own README that describes it in more detail.
graph BT
CORE[aggregation-framework-core]
JSON[aggregation-framework-json]
KAFKA[aggregation-framework-kafka]
SEL[aggregation-framework-selenium]
RUNNER[aggregation-framework-runner]
CORE --> JSON & KAFKA & SEL & RUNNER
This project uses sbt for project and dependency management. Install sbt via your preferred package manager; if you use IntelliJ, it can manage sbt for you.
To build the entire project:
sbt compileThis program is licensed under the GNU Lesser General Public License, version 3.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General
Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any
later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU Lesser General Public License (and the GNU General Public License) along
with this program. If not, see https://www.gnu.org/licenses/.