UrlStore.get_download_urls():timelimitremoved, fix type hints (#119, 19c580e)extract_links(): deprecatebase_urlparameter (#121)- setup: simplify workflow (#118)
UrlStorecompression: make bz2 and zlib optional, update pickle protocol (#113)extract_links(): review and document, add deprecation warning forbase_urlargument (#115)- maintenance: add
__all__toinit.pyand lint code (#116)
- parsing: validate netloc with port number by @naz-theori in #104
- cleaning: fix handling of apostrophes (#107)
- maintenance: deprecate Python 3.6 & 3.7, add
pyproject.tomlsetup file (#59, #105)
- more compact UrlStore: use bytes instead of str for URL paths (#88)
- UrlStore maintenance: deprecate
timelimitargument (#101) - maintenance: simplify code (#103)
- support for Python 3.13
- replace
langcodesbybabeland use its information on locales (#89, #92) - simplified and faster code: domain extraction, cleaning, filters and UrlStore (#90, #93, #94, #95)
- UrlStore: better url batches, replace
timelimitparameter bytime_limit(#91) - maintenance: update readme and convert it to markdown (#97)
- license change from GPLv3+ to Apache 2.0 (#81)
- UrlStore:
write()method andload_store()function added (#83) - add parameter
trailing_slashto keep of discard slashes at the end of URLs (#52) - maintenance: fix whitespace in
clean_url()(#77), simplify code (#79)
- IRI to URI normalization: encode path, query and fragments (#58, #60)
- normalization: strip common trackers (#65)
- new function
is_valid_url()(#63) - hardening of domain filter (#64)
- new UrlStore functions:
add_from_html()(#42),discard()(#44),get_unvisited_domains - CLI: removed
--samplesize, use--samplewith an integer instead (#54) - added plausibility filter for domains/hosts (#48)
- speedups and more efficient processing (#47, #49, #50)
- fixed handling of relative URLs with @feltcat in #46
- fixed bugs and ensured compatibility (#41, #43, #51, #56)
- official support for Python 3.12
- more efficient URL parsing (#33)
- refined link extraction and link filters (#30, #36)
- more efficient normalization (#32)
- more efficient sampling strategy (#31, #35)
- added meta function to clear LRU caches (#34)
- added parallel option in command-line interface (#37, #39)
- added
get_unvisited_domains()method toUrlStore(#40)
- add blogspot archives to type filter
- maintenance: upgrade
urllib3and review code
- network tests: larger throughput
- UrlStore: optional compression of rules (#21), added
reset()(#22) andget_all_counts()methods - UrlStore fixes:
signalin #18,total_url_number - updated Readme
- hardening of filters and URL parses (#14)
- normalize punicode to unicode
- methods added to
UrlStore:get_crawl_delay(),print_unvisited_urls() UrlStorenow triggers exit code 1 when interrupted- argument added to
extract_links():no_filter - code refactoring: simplifications
- fixed bug in domain name extraction
- uniform logging parameters
- full type hinting
- maintenance: code linted
- add type annotations and check with
mypy url_filter()function moved from Trafilatura- code style: use
black
- performance optimizations
- fast track for domain extraction (
extract_domain(url, fast=True)), now taking subdomains into account
- UrlStore: threading lock and convenience functions added
- bug in sampling fixed
- UrlStore: validation by default
- UrlStore class added: data store containing URLs with relevant information
- code cleaning and maintenance (bugs, simplification)
- reviewed code base: simplicity and execution speed
- dropped support for Python 3.5
- more complex language heuristics, use langcodes
- extended blacklists and whitelists
- more precise filters and more efficient code
- support for Python 3.10
- enhanced cleaning
- fixed language filter
- keep trailing slashes to avoid redirection
- fixes: normalization and crawlable URLs
- URL manipulation tools added: extract parts, fix relative URLs
- filters added: language, navigation and crawls
- more robust link handling and extraction
- removed support for Python 3.4
- improve filter precision
- reduced dependencies: replace requests with bare urllib3, and tldextract with tld for Python 3.6 upwards
- better path and fragment normalization
- Python 3.9 compatibility
- Simplified imports
- Bug fixes
- English and German language filters
- Function to detect external links
- Support for domain blacklisting
- Less aggressive strict filters
- CLI bug fixed
- Cleaner and more efficient filtering
- Helper functions to scrub, clean and normalize
- Removed two dependencies with more extensive usage of urllib.parse
- Cleaning and filtering targeting non-spam HTML pages with primarily text
- URL validation
- Sampling by domain name
- Command-line interface (CLI) and Python tool