Page MenuHomePhabricator

Whitespace around comment removed
Closed, ResolvedPublic

Description

e.g. https://en.wikipedia.org/?diff=1010309304&diffmode=source

This is a known issue when there is leading / trailing whitespace in list items interspersed with comments and other rendering-transparent nodes (categories, etc.). Parsoid stores a single offset value for leading / trailing offsets which is then used to recover whitespace from source on edited items.

Consider this example:

[subbu@earth:~/work/wmf/parsoid] echo -e '* x \n* y<!--boo--> \n* z <!--boo--> ' | php bin/parse.php
<ul data-parsoid='{"dsr":[0,35,0,0]}'><li data-parsoid='{"dsr":[0,4,1,0,1,1]}'>x</li>
<li data-parsoid='{"dsr":[5,19,1,0,1,1]}'>y<!--boo--></li>
<li data-parsoid='{"dsr":[20,35,1,0,1,-1]}'>z<!--boo--></li></ul>

Here the $dsr[5] is -1 for the last list item reflecting that interspersing and so when that list item is edited (say by adding a nested list item as happens in replies and that enwiki diff above), that whitespace is lost and causes dirtying.

There are couple possible approaches here:
(a) Change that dsr offset property from a single element to an array (something @cscott had suggested as a solution if it becomes necessary). This just makes the code more gnarly.
(b) Improve selser to better detect insertions at the end of a DOM element and use dsr-difference between the original version of the element and its sibling / parent to recover all non-rendering content (whitespace, comment, etc.).

I prefer (b) over (a) since it is an additional refinement of the selser algorithm without relying on collecting ever more fine-grained source offset information for questionable value.

Both involve a bit of work, so if / how soon we pick this up depends on how badly we need this fix.

Event Timeline

ssastry updated the task description. (Show Details)
ssastry updated the task description. (Show Details)
ssastry added a subscriber: cscott.

Change 668801 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] WIP: Quick hacky proof-of-concept

https://gerrit.wikimedia.org/r/668801

Esanders renamed this task from Whitespace around comment remove to Whitespace around comment removed.Mar 5 2021, 10:17 PM

That patch fixes the bug but it needs more work and cleanup to do it right. It goes with strategy (b) from the description.

[subbu@earth:~/work/wmf/parsoid] selser.sh
3c3,4
< * z <!--boo--> 
---
> * z<!--boo-->
> ** BOO
[subbu@earth:~/work/wmf/parsoid] git checkout T276620
Switched to branch 'T276620'
[subbu@earth:~/work/wmf/parsoid] selser.sh
3a4
> ** BOO

Change 668801 merged by jenkins-bot:
[mediawiki/services/parsoid@master] html2wt: Improve heuristics enabling reuse of separators from source

https://gerrit.wikimedia.org/r/668801

Change 675310 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 had a related patch set uploaded (by C. Scott Ananian; author: Subramanya Sastry):
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738

Change 675310 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 merged by jenkins-bot:
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738

For the sake of documentation, this was probably the most common dirty diffs in DiscussionTools edits recently, particularly on English Wikipedia. Examples:

I haven't seen any instances of it since April 1, so the patch seems to have resolved the problem!

ssastry claimed this task.

Yes, I noticed. I expect it is some edge case where the conditions for whitespace reuse were obviously not satisfied. I'll ignore this for now unless we see more of them showing up.