Skip to content

Content with whitespace between XHTML tags #66

@epavlova

Description

@epavlova

There are content pieces which contain unnecessary whitespace between XHTML tags in their body. This issue is closely related to PR #61.

Currently, the content tree’s body-block definition does not allow text nodes as children, which means there is no way to represent such content pieces in the content tree format without losing these whitespaces. Additionally, the reverse transformation (from content tree back to bodyXML) will not match the original bodyXML stored in C&M. This discrepancy will complicate the automated testing of content transformations we are implementing at the moment.


Impact & Examples

Based on regex >\s+<, there are 329 such content pieces, including Specialist content. This number could be incorrect as I'm not sure that this regex covers all required scenarios.

... Last year, it secured<a href=\"https://www.idaireland.com/latest-news/press-release/record-r-d,-strong-capital-investment,-and-high-number-of-new-investors\"> </a>a record €1.9bn in research,...
<body><ft-content type=\"http://www.ft.com/ontology/content/ImageSet\" url=\"http://api.ft.com/content/93d94e4e-9883-5332-a576-0e928bfda3f1\" data-embedded=\"true\"></ft-content> \n    \n\n\n\n\n\n<p>We’ve deliberated, cogitated and digested but, unlike Lloyd Grossman (presenter of the 1990s iteration of TV’s <em>MasterChef</em>), not concluded: analysing companies is more of a movable feast. Having dived into the drivers of <a href=\"https://markets.investorschronicle.co.uk/data/equities/tearsheet/summary?s=RR.:LSE\" title=\"\"><strong>Rolls-Royce</strong>’<strong>s</strong> <strong>(RR.)</strong>

There are examples from the The Banker articles as well.

  • The old Pink FT content seems to have a pattern to put new lines between <p> tags, many such examples were found.

Searching for clarity on human rights, Jan 2005:

...corporate behaviour. </p>\n<p>But if the business community felt it had won a victory in the apartheid litigation, it suffered something of a set-back a few weeks later. </p>\n<p>In mid-December, Unocal, the US oil company, announced it had reached a decision in principle to settle Alien Tort litigation over alleged complicity in human rights abuses in Burma. </p>\n<p>The Unocal case...

Proposed approach

  • Replace \n with <br> where applicable.
  • Remove other unnecessary whitespace between tags. Here we make the important assumption that the whitespace between tags is unnecessary.
  • Republish affected articles without notifying the B2B consumers.
    The idea behind the republishing is that when it comes to comparing the stored version of the body in C&M platform and the transformed version from content tree, they should be equal and we can use that fact to verify the accuracy of the transformers.

One challenge is that the validation currently allows such whitespace, meaning more affected content may continue to accumulate before bodyXML is deprecated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions