Skip to content

Collect data | Machine Learning Phase I#162

Merged
PSNAppz merged 15 commits intodevfrom
collect_data
Oct 4, 2019
Merged

Collect data | Machine Learning Phase I#162
PSNAppz merged 15 commits intodevfrom
collect_data

Conversation

@KingAkeem
Copy link
Member

@KingAkeem KingAkeem commented May 25, 2019

Issue #161

Changes Proposed

  • Gather data using thehiddenwiki.org, use --gather to perform operation.

Explanation of Changes

Save entries to csv file using the subjects of ID | TITLE | META TAGS | CONTENT

  • ID - Unique ID that corresponds to a site
  • Title - Title of site
  • Meta tags - metadata found on site in the form of <meta>
  • Content - Raw html from site.

Copy link
Member

@PSNAppz PSNAppz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good. 2 things need to be changed though.

  1. We just need to change the content part. Every website contains a <meta content="some description" name="description"> tag. We just need that information. Or if this is empty we could just grab the contents inside <body> tag. This way all the noise is removed.

  2. The title is not properly saved. Make sure you grab the title from <title> tag.

@KingAkeem
Copy link
Member Author

The title is being saved property, I'm grabbing the text.

title = soup.title.getText() if soup.title else 'No Title'

@DedSecInside DedSecInside deleted a comment May 29, 2019
@KingAkeem
Copy link
Member Author

This is ready to be re-reviewed

@KingAkeem KingAkeem closed this May 31, 2019
@KingAkeem KingAkeem reopened this May 31, 2019
@PSNAppz
Copy link
Member

PSNAppz commented Jun 23, 2019

  1. We just need to change the content part. Every website contains a <meta content="some description" name="description"> tag. We just need that information. Or if this is empty we could just grab the contents inside <body> tag. This way all the noise is removed.

This is still not fixed?

@KingAkeem
Copy link
Member Author

That was done in this commit 6520073

@PSNAppz
Copy link
Member

PSNAppz commented Jul 3, 2019

Ready for review?

@KingAkeem
Copy link
Member Author

Yep yep

Copy link
Member

@PSNAppz PSNAppz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@PSNAppz PSNAppz changed the title Collect data Collect data | Machine Learning Phase I Oct 4, 2019
@PSNAppz PSNAppz merged commit 6a3cea4 into dev Oct 4, 2019
@KingAkeem KingAkeem deleted the collect_data branch July 9, 2021 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants