Skip to content

Add an initial content-tree-to-text implementation for indexing#42

Merged
chee merged 3 commits into
mainfrom
content-tree-to-text
Jan 4, 2024
Merged

Add an initial content-tree-to-text implementation for indexing#42
chee merged 3 commits into
mainfrom
content-tree-to-text

Conversation

@chee
Copy link
Copy Markdown
Contributor

@chee chee commented Dec 20, 2023

Based on the transformer from here:

https://github.com/Financial-Times/ccf-content-rw-elasticsearch/blob/b01e6aab54d53167ff5e8b59c25171a023978c60/pkg/html/transformer.go

There are a few test case input->expected output in the root tests/ folder that could be used by implementations in other languages, so we know we're all doing it the same way. We can add more tests as we like. I've just done an old methode article and a Spark kitchen sink so far.

folders have been added for test cases have also been added for a future content-tree->external-content bodyxml converter (will be used for external consumers in 2024 and beyond), and a bodyxml->content-tree (for a future migration tool 👀 for our old articles).

This is just a first pass, can probably be improved, mostly here to register the idea and move towards the future

@chee chee requested review from a team as code owners December 20, 2023 15:17
@chee chee requested review from adgad and katiefenn and removed request for a team December 20, 2023 15:17
Based on the transformer from here:

https://github.com/Financial-Times/ccf-content-rw-elasticsearch/blob/b01e6aab54d53167ff5e8b59c25171a023978c60/pkg/html/transformer.go

There are a few test case input->expected output in the root tests/ folder that could be used by implementations in other languages, so we know we're all doing it the same way. We can add more tests as we like. I've just done an old methode article and a Spark kitchen sink so far.

folders have been added for test cases have also been added for a future content-tree->external-content bodyxml converter (will be used for external consumers in 2024 and beyond), and a bodyxml->content-tree (for a future migration tool 👀 for our old articles).

This is just a first pass, can probably be improved, mostly here to register the idea and move towards the future
@chee chee force-pushed the content-tree-to-text branch from cba76be to 0f0aa1c Compare December 20, 2023 15:19
@chee chee removed the request for review from katiefenn January 2, 2024 09:38
Copy link
Copy Markdown
Collaborator

@adgad adgad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, love the simplicity/usefulness of the test inputs/outputs.

Coupla small comments, but generally looks good from my POV. But better for C&M to review the actual usability.

I guess eventually it could also be used by CP to generate the bodyText used in elasticsearch

Comment thread package.json
Comment thread libraries/content-tree-to-string/test.js Outdated
Comment thread libraries/content-tree-to-string/test.js Outdated
Copy link
Copy Markdown
Contributor

@epavlova epavlova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to what we currently have in the C&M platform, this approach is so concise and meaningful. I hope soon we can contribute with a Go implementation.

I'm investigating "transient" and "opaque" elements in the body(whether their text should be present in the "stringified" version of the content) but this is detail that we can figure it out later.

And it's not only ft.com, we are currently indexing 3 different stores with "tag-free" version of the body of the content. Content analytics team also index the content for the Professional search and there is also our Search&Recommendations team. Just saying that when we are confident with the implementations a lot of teams will benefit from it.

@chee chee merged commit d48aa94 into main Jan 4, 2024
@chee chee deleted the content-tree-to-text branch January 4, 2024 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants