Add an initial content-tree-to-text implementation for indexing#42
Conversation
Based on the transformer from here: https://github.com/Financial-Times/ccf-content-rw-elasticsearch/blob/b01e6aab54d53167ff5e8b59c25171a023978c60/pkg/html/transformer.go There are a few test case input->expected output in the root tests/ folder that could be used by implementations in other languages, so we know we're all doing it the same way. We can add more tests as we like. I've just done an old methode article and a Spark kitchen sink so far. folders have been added for test cases have also been added for a future content-tree->external-content bodyxml converter (will be used for external consumers in 2024 and beyond), and a bodyxml->content-tree (for a future migration tool 👀 for our old articles). This is just a first pass, can probably be improved, mostly here to register the idea and move towards the future
cba76be to
0f0aa1c
Compare
adgad
left a comment
There was a problem hiding this comment.
nice, love the simplicity/usefulness of the test inputs/outputs.
Coupla small comments, but generally looks good from my POV. But better for C&M to review the actual usability.
I guess eventually it could also be used by CP to generate the bodyText used in elasticsearch
There was a problem hiding this comment.
Compared to what we currently have in the C&M platform, this approach is so concise and meaningful. I hope soon we can contribute with a Go implementation.
I'm investigating "transient" and "opaque" elements in the body(whether their text should be present in the "stringified" version of the content) but this is detail that we can figure it out later.
And it's not only ft.com, we are currently indexing 3 different stores with "tag-free" version of the body of the content. Content analytics team also index the content for the Professional search and there is also our Search&Recommendations team. Just saying that when we are confident with the implementations a lot of teams will benefit from it.
Based on the transformer from here:
https://github.com/Financial-Times/ccf-content-rw-elasticsearch/blob/b01e6aab54d53167ff5e8b59c25171a023978c60/pkg/html/transformer.go
There are a few test case input->expected output in the root tests/ folder that could be used by implementations in other languages, so we know we're all doing it the same way. We can add more tests as we like. I've just done an old methode article and a Spark kitchen sink so far.
folders have been added for test cases have also been added for a future content-tree->external-content bodyxml converter (will be used for external consumers in 2024 and beyond), and a bodyxml->content-tree (for a future migration tool 👀 for our old articles).
This is just a first pass, can probably be improved, mostly here to register the idea and move towards the future