Feat: process_start_urls in parallel#159
Open
aircloud wants to merge 2 commits intohowie6879:masterfrom
Open
Conversation
Owner
|
确实有这个问题,ruia会一次获取所有的url再开始爬取
为什么不在外层你获取到一个url就丢到ruia实现的Spide类去执行,如: class DemoSpider:
pass
async for url in mq_urls:
DemoSpider.starts_url = [url]
await DemoSpider.start()还是我理解错了你的意思? |
Author
|
按照上面的写法的话,我理解相当于:获取一个、执行一个、再获取一个 这样是不是没办法用它的并发能力了? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
process_start_urls 目前会阻塞下方 workers,这样要求使用者一开始就有所有 url
有的时候 url 可能是从某个接口或者消息队列持续地获取到的,一开始并没有。
因此我的改法是将其并行化,作者有空烦请看看这样是否有问题。