Skip to content

Conversation

@iansan5653
Copy link

@iansan5653 iansan5653 commented Aug 27, 2025

Initial checklist

  • I read the support docs
  • I read the contributing guide
  • I agree to follow the code of conduct
  • I searched issues and discussions and couldn’t find anything or linked relevant results below
  • I made sure the docs are up to date
  • I included tests (or that’s not needed)

Description of changes

TLDR: Adds support for extensions' enter and exit handlers to return false to fall back to earlier extensions or the default logic.

Context

I am working on a project where I'd like to attach additional information to nodes based on the information already in the tokenization output. In this particular case, I'm trying to record data about the specific syntax that formed a node in the tree (essentially building a sort of concrete syntax tree on top of the AST).

While building an extension that provides handlers for the relevant syntax tokens and attaches them to nodes in the stack, I noticed that I was breaking the built-in parsing functionality. Upon digging I realized that this was because my enter handlers were overriding and disabling the default handlers, even though I only wanted to encounter the tokens without doing the parsing and tree building myself.

I realized that a very useful feature of extension handlers would be the ability to 'fall through' to the default behavior if desired. To me the most intuitive approach for this was to support returning false to indicate that a token was not handled by the handler. This is a flexible API that allows for handlers to fall through conditionally, allowing them to choose to only take over the handling of a token in certain circumstances.

Since the code change here is fairly straightforward, I chose to go ahead and open a PR rather than start with an issue. I am very open to feedback here and happy to change the approach or reject the change entirely if you don't think it's useful.

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Aug 27, 2025
@wooorm
Copy link
Member

wooorm commented Aug 28, 2025

Hey!

Can you expand on your actual use case? Sounds like a lot of work to overwrite all handlers.
And, what if there is no handler before?
And, what for extensions that add more handlers?
How would you actually use this?

Why not reach for a beforeEnter/afterEnter/beforeExit/afterExit, something like that?

@iansan5653
Copy link
Author

iansan5653 commented Aug 28, 2025

Sure!

Can you expand on your actual use case? Sounds like a lot of work to overwrite all handlers.

My specific use case is that I am working on a Markdown editor and so I need to know the locations in code of many token types. I could just use micromark tokens directly, but the output from mdast-util-from-markdown is also useful as I also want to integrate some information about the document structure. Parsing out the boundaries of a listItem from the tokens, for example, is pretty difficult and I'd rather not duplicate that logic.

I could handle the mdast tree and the micromark tokens completely separately, but that would require inefficiently tokenizing the document twice and working with two data structures in parallel that would be hard to link together. I'd rather combine all the information I need into one tree by hooking into the existing functionality.

And, what if there is no handler before?

The combineHandlers logic I've defined would handle this case by simply returning the right handler.

And, what for extensions that add more handlers?

If extensions defined after this one define handlers for the same tokens, they would fall through in the same way:

  • If they return undefined, previously-defined handlers would not be called
  • If they return false, previously-defined handlers would still be called

How would you actually use this?

Here's a very simplified example (in practice I will handle many more token types and have a more generic handler):

function handleQuoteMarker(token) {
  this.stack.findLast(n => n.type === 'quote')?.children.push({
    type: "quoteMarker",
    position: {
      end: token.end,
      start: token.start,
    },
  });

  return false // allow the parser to still handle this token, if necessary, to build quote nodes
};

const tree = fromMarkdown(markdown, "utf8", {
  mdastExtensions: [{
    exit: {
      blockQuoteMarker: handleQuoteMarker
    }
  }],
});

Why not reach for a beforeEnter/afterEnter/beforeExit/afterExit, something like that?

That would also work by introducing new events that are less likely to conflict, although I would still be concerned about the possibility that the package internals or other extensions might start to depend on these in the future. I still wouldn't want my extension to override any default functionality, just augment it.

Perhaps the design for these new events could specify that all registered handlers for a token are always called in a consistent order, regardless of conflicts across extensions? Would it be confusing though if these work in a different way from enter and exit? But maybe that's fine since they would act more like "events" than the enter/exit handlers do.

@iansan5653
Copy link
Author

Hello @wooorm, I'm still interested in shipping this feature or some form of it if possible. We have been working around it by patching the installed NPM dependency but of course that's definitely not ideal.

Any chance we might be able to get this across the line? Happy to iterate on it and change the approach if you prefer something else.

@wooorm
Copy link
Member

wooorm commented Dec 4, 2025

Ah, very cool that GH is building exciting things on this :)

Thanks for your patience!

From your OG goal above, maybe the tokens are indeed what you should use.
Alternatively, when you are going with an AST, why not walk that AST? Instead of trying to intercept everything?

To get around: “that would require inefficiently tokenizing the document twice”, perhaps an alternative is to get access to the original tokens from here?
For “and working with two data structures in parallel that would be hard to link together”, well, you are dealing with 2 now, and yes, that comes with difficulties, but I don’t yet see how this makes that less difficult?
For “I'd rather combine all the information I need into one tree by hooking into the existing functionality.”, what kind of new different tree are you making?

Here's a very simplified example (in practice I will handle many more token types and have a more generic handler):

That’s interesting! What is this quote node? A custom construct? If that’s totally custom, then why are you worrying about things being overwritten? 🤔

That would also work by introducing new events that are less likely to conflict, although I would still be concerned about the possibility that the package internals or other extensions might start to depend on these in the future. I still wouldn't want my extension to override any default functionality, just augment it.

Perhaps the design for these new events could specify that all registered handlers for a token are always called in a consistent order, regardless of conflicts across extensions? Would it be confusing though if these work in a different way from enter and exit? But maybe that's fine since they would act more like "events" than the enter/exit handlers do.

Handlers (in the existing sense) are supposed to create something, mutate the result, and can be overwritten to do things differently. They assume control.
My idea for these more “listeners”, is to not really mutate the nodes. Maybe they add other fields (such as expose some extra stuff in data).

From your quote example, you are pushing children into some parent node. What if the other extensions you want to “fall through” expect those not to be there?

But, your example seems to be about a new node, so I don’t quite see how a fall through is needed, as no extension will handle it?

Sorry for the wait and all the questions!

@iansan5653
Copy link
Author

iansan5653 commented Dec 11, 2025

Alternatively, when you are going with an AST, why not walk that AST? Instead of trying to intercept everything?

To get around: “that would require inefficiently tokenizing the document twice”, perhaps an alternative is to get access to the original tokens from here?

How would we get access to the original tokens though? There's no token information on the resulting AST, so when we walk the AST, how could we locate the tokens that 'belong' to each node?

In other words, say we build an mdast tree from the source document, and now we want to augment that tree with the syntax token data. Since there's no tokens in the tree itself, we'd need to retokenize the entire document. This would be rather expensive compared to simply augmenting the tree as we build it, when the parser is already iterating through the tokens.

I guess with your "listeners" concept we could record each token as it is encountered as a side effect, so that we'd have the tokens saved for reuse when walking the AST. That would still be a rather complicated / expensive solution even without the retokenization step:

  • Register a listener that would store every token event in an array
  • Build the mdast AST
  • Walk the tree from the bottom up. For each node:
    • Search through the array for tokens that start and end in the node based on their offsets
    • For each token:
      • Build a corresponding AST node
      • Insert it into the parent node's children array in the right location
      • Remove it from the token array so it doesn't get added again while traversing up the tree

That’s interesting! What is this quote node? A custom construct? If that’s totally custom, then why are you worrying about things being overwritten? 🤔

That's a typo in my example - it should be blockquote rather than quote. The example was supposed to highlight that the target is not a custom node, but my typo made that more confusing, sorry!

Handlers (in the existing sense) are supposed to create something, mutate the result, and can be overwritten to do things differently. They assume control.

I think this summarizes well what the goal is with this change - in addition to supporting handlers that create something, it would add support for handlers that mutate something that was already created. The only way to do that at present is to re-walk the AST after creating it, which has the problem of lack of access to the original tokens as mentioned earlier.

From your quote example, you are pushing children into some parent node. What if the other extensions you want to “fall through” expect those not to be there?

In my opinion this is an acceptable risk. It's not unlike the risk of a handler that introduces a fully custom new node type -- the parent still might not expect that node to be there but we accept that risk when creating such low-level extensions.

But, your example seems to be about a new node, so I don’t quite see how a fall through is needed, as no extension will handle it?

Sorry again for the confusion here! Definitely meant for this to be the built-in node.

@iansan5653
Copy link
Author

iansan5653 commented Dec 11, 2025

If you are interested, here's the actual full source code of the micromark extension we have written to insert delimiter nodes into the AST. I didn't share it before because it was messy / overly complex for a simple proof of concept, but I've since cleaned it up a bit:

/** All Micromark token types that we convert to delimiter nodes. */
const delimiterTokenTypes = [
  'blockQuotePrefix',
  'strongSequence',
  'emphasisSequence',
  'codeTextSequence',
  'codeTextPadding',
  'strikethroughSequence',
  'thematicBreakSequence',
  'codeFencedFence',
  'atxHeadingSequence',
  'whitespace',
  'setextHeadingLine',
  'labelMarker',
  'labelImageMarker',
  'resource',
  'resourceDestination',
  'linePrefix',
  'listItemIndent',
  'listItemPrefix',
  'taskListCheck',
  'tableCellDivider',
  'tableDelimiterRow',
  'tableDelimiter',
  'autolinkMarker',
  'hardBreakEscape',
  'escapeMarker',
  'characterReference',
]

/** Util that returns the lowest open node on the stack - the one that we should push children to. */
function getParent(context) {
  // First look for any open delimiters in the delimiter stack
  if (context.data.delimiterStack?.length) return context.data.delimiterStack.at(-1)!

  const parent = context.stack.at(-1)

  // micromark uses Fragment nodes as temporary buckets to push things to. For most nodes we can just push to the fragment
  // and mdast will keep our data when it moves those nodes to the correct parent. But for Literal nodes (nodes with
  // a 'value' property), it will discard nodes we push to the fragment as it builds the plain string value. So in
  // that case we push to the fragment's parent (the Literal node) instead.
  if (parent?.type === 'fragment') {
    const grandparent = context.stack.at(-2)
    if (grandparent && 'value' in grandparent) return grandparent
  }

  return parent
}

/** Create a delimiter node and push it to the currently open node. */
const handleDelimiterEnter = function (this, token) {
  const node = {
    type: 'delimiter',
    tokenType: token.type,
    raw: this.sliceSerialize(token),
    position: {
      start: token.start,
      end: token.end,
    },
  }

  // Normally, we could just enter the delimiter using the built-in helper: `this.enter(delimiter, token)`
  // However, this would push the delimiter onto the main stack, which breaks micromark's expectations.
  // To work around this, we maintain a separate `delimiterStack` just for delimiters.
  const parent = getParent(this)
  if (parent) {
    parent.children ??= []
    parent.children.push(node)

    this.data.delimiterStack ??= []
    this.data.delimiterStack.push(node)
  }

  // Return false to fall through to the default logic to allow mdast to still create regular nodes from this token
  // (only works with our custom patch!)
  return false
}

/** Remove the delimiter from the stack to close it out. */
const handleDelimiterExit = function (this, token) {
  const node = this.data.delimiterStack?.pop()

  // This should be impossible, because we should never exit a node we didn't just enter
  if (node?.tokenType !== token.type) throw new Error(`Expected open '${token.type}' delimiter on stack`)

  return false
}

export function mdastDelimitersExtension() {
  return {
    enter: Object.fromEntries(delimiterTokenTypes.map(type => [type, handleDelimiterEnter])),
    exit: Object.fromEntries(delimiterTokenTypes.map(type => [type, handleDelimiterExit])),
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🤞 phase/open Post is being triaged manually

Development

Successfully merging this pull request may close these issues.

2 participants