There are content pieces which contain unnecessary whitespace between XHTML tags in their body. This issue is closely related to PR #61.
Currently, the content tree’s body-block definition does not allow text nodes as children, which means there is no way to represent such content pieces in the content tree format without losing these whitespaces. Additionally, the reverse transformation (from content tree back to bodyXML) will not match the original bodyXML stored in C&M. This discrepancy will complicate the automated testing of content transformations we are implementing at the moment.
Impact & Examples
Based on regex >\s+<, there are 329 such content pieces, including Specialist content. This number could be incorrect as I'm not sure that this regex covers all required scenarios.
... Last year, it secured<a href=\"https://www.idaireland.com/latest-news/press-release/record-r-d,-strong-capital-investment,-and-high-number-of-new-investors\"> </a>a record €1.9bn in research,...
<body><ft-content type=\"http://www.ft.com/ontology/content/ImageSet\" url=\"http://api.ft.com/content/93d94e4e-9883-5332-a576-0e928bfda3f1\" data-embedded=\"true\"></ft-content> \n \n\n\n\n\n\n<p>We’ve deliberated, cogitated and digested but, unlike Lloyd Grossman (presenter of the 1990s iteration of TV’s <em>MasterChef</em>), not concluded: analysing companies is more of a movable feast. Having dived into the drivers of <a href=\"https://markets.investorschronicle.co.uk/data/equities/tearsheet/summary?s=RR.:LSE\" title=\"\"><strong>Rolls-Royce</strong>’<strong>s</strong> <strong>(RR.)</strong>
There are examples from the The Banker articles as well.
- The old Pink FT content seems to have a pattern to put new lines between
<p> tags, many such examples were found.
Searching for clarity on human rights, Jan 2005:
...corporate behaviour. </p>\n<p>But if the business community felt it had won a victory in the apartheid litigation, it suffered something of a set-back a few weeks later. </p>\n<p>In mid-December, Unocal, the US oil company, announced it had reached a decision in principle to settle Alien Tort litigation over alleged complicity in human rights abuses in Burma. </p>\n<p>The Unocal case...
Proposed approach
- Replace
\n with <br> where applicable.
- Remove other unnecessary whitespace between tags. Here we make the important assumption that the whitespace between tags is unnecessary.
- Republish affected articles without notifying the B2B consumers.
The idea behind the republishing is that when it comes to comparing the stored version of the body in C&M platform and the transformed version from content tree, they should be equal and we can use that fact to verify the accuracy of the transformers.
One challenge is that the validation currently allows such whitespace, meaning more affected content may continue to accumulate before bodyXML is deprecated.
There are content pieces which contain unnecessary whitespace between XHTML tags in their body. This issue is closely related to PR #61.
Currently, the content tree’s
body-blockdefinition does not allowtextnodes as children, which means there is no way to represent such content pieces in the content tree format without losing these whitespaces. Additionally, the reverse transformation (from content tree back tobodyXML) will not match the originalbodyXMLstored in C&M. This discrepancy will complicate the automated testing of content transformations we are implementing at the moment.Impact & Examples
Based on regex
>\s+<, there are 329 such content pieces, including Specialist content. This number could be incorrect as I'm not sure that this regex covers all required scenarios.There are examples from the The Banker articles as well.
<p>tags, many such examples were found.Searching for clarity on human rights, Jan 2005:
Proposed approach
\nwith<br>where applicable.The idea behind the republishing is that when it comes to comparing the stored version of the body in C&M platform and the transformed version from content tree, they should be equal and we can use that fact to verify the accuracy of the transformers.
One challenge is that the validation currently allows such whitespace, meaning more affected content may continue to accumulate before
bodyXMLis deprecated.