But here's the way I'd like to see it done:

2. Take the data characters in the content of a block
structuring element (or heading) and its descendants, and break
it into words, delmited by spaces.

Many languages do not delimit words by spaces; e.g., Japanese, Chinese,
Thai, etc. In fact, the written forms of these languages often make
it impossible to determine word boundaries without understand the semantics
of the text.