Shortcode Problems: WordPress 4.4

miqrogroove
2015-08-31T11:07:48+00:00

I will briefly summarize Shortcode API changes since WordPress 4.0 and then kick off some ideas for a roadmap.

The first major accomplishment was the expansion of the API documentation, including a new large section I wrote about the formal syntax for shortcode input.

I also put forward a robust parser concept for the function wptexturize() that promised to re-introduce the ability to use unrestricted HTML code inside of shortcodes and shortcode attributes.  That concept went through many, many changes before being introduced in v4.2.3.  After consulting with the WordPress security team, and after extensive testing of the shortcode parsing functions, we determined that the shortcodes-first parsing strategy was fundamentally flawed and could not be included with any version beyond v4.2.2.  This is why I added an HTML parser to the Shortcode API and ultimately curtailed the use of shortcodes inside HTML rather than expand the use of HTML inside shortcodes.

Questions for the Roadmap

Should We Allow Shortcodes in HTML Attributes?

There are strong arguments for and against any use of shortcodes inside the HTML angle braces.  As of v4.2.3, all users are effectively restricted to the author-level HTML filter whenever a shortcode is used.  We could try to improve upon this in several ways:

  1. Implement the unfiltered_html capability in the Shortcode API so that different users will experience different levels of HTML filtering around shortcodes.  This would restore all shortcodes-in-HTML functionality, but only for Administrators and Editors.  Other users would remain filtered, possibly adding to the confusion.
  2. Add a more simple parameter to the Shortcode API that indicates whether or not all of the input needs to be filtered.  Same idea here as above, but less flexible.
  3. Further restrict shortcodes in unnamed HTML elements.  There were many strong opinions about whether we should do this already in v4.3.
  4. Ignore all shortcodes in HTML attributes.  It would be fast and secure, but it would break some plugins.
  5. Do nothing.  The advantage of not changing the API is increased plugin stability.

Should We Allow HTML in Shortcode Attributes?

The problem with HTML inside of shortcodes is that it greatly increases the complexity of the other content filters, such as wptexturize() and wpautop().  That complexity results in many bugs that are difficult if not impossible to fix.  Here are some ways to deal with these problems:

  1. Ignore all shortcodes that contain angle braces.  This would make our filters a bit more simple, but would break some existing content.  The content filters would still have to include a lot of extra code to avoid treating the shortcodes as regular content.
  2. Change the default filter order so that shortcodes are processed before running the other content filters (but not before KSES when needed).  This would greatly simplify the problems with the other content filters, but would break some plugins due to the change of expected output.
  3. Do nothing.  This is good for plugin stability, but very bad for core quality.

Should We Re-Parse the HTML with Each Filter?

The more bugs that we fix in the Formatting component of WordPress, the more places we will have to test individual HTML elements to get correct output.  So it starts to look very repetitive to parse and loop over here, then parse and loop over there, and so on.  However, due to the nature of shortcodes and “autop” adding more HTML elements within the content, I am starting to feel comfortable with the idea of repeatedly parsing the HTML.  Even if I try a more structured approach, it will still be necessary to search and replace large portions of that structure.  With the addition of wp_html_split() in v4.2.3 I hope to at least move away from having many different HTML parsers in the core code.  If there are any concerns about this parsing strategy, they should be addressed now rather than later.

I Can Has_Shortcode?

One of the new tickets points out that the function has_shortcode() is neglected and doesn’t work the same way as the others. Do we need a more modular design for the API or does this just need a quick update?  Did this function ever work correctly?  Does anyone still use it?

If We Could Go Back and Do This All Over Again…

One of the central lessons of these API changes is that the square brace syntax for shortcodes was a poor choice.  It doesn’t mix well with HTML, it doesn’t parse well with regex, and it prevents users from typing square braces inside shortcode attribute values.  If we could re-invent shortcodes today, we would certainly not use the same syntax.

On that note, I would like to propose adding a second shortcode syntax.  I have not yet developed any details, but I would like to accomplish the following objectives:

  1. Opening and closing delimeters comprised of two or more characters that would be much more unique than a single square brace.
  2. Backward compatibility with most existing shortcodes.  Minimal code changes should be needed in plugins that want to use this new syntax.  This means, first and foremost, the function add_shortcode() will automatically register both the new and old syntax by default.
  3. New options for shortcode registration will be available only when using the new syntax for input.  Naturally, the first new option will allow each shortcode to drop support for the old syntax.  Other options might allow each shortcode to specify a user capability, parent/child nesting, notexturize, and no autop to resolve other issues.  The media embedding system may need options for registering multiple callbacks while we still support the old syntax.
  4. Strictly forbid shortcodes-in-HTML and HTML-in-shortcodes for the new syntax.  I want the new delimeters to be slightly magical, so the post editor and/or the pre-save filters would actually encode or strip any angle braces found inside the new shortcode syntax before running KSES.  This filtering would apply to all users and administrators.  The new syntax is not magical in HTML attributes and will be ignored there.
  5. Because of the added restrictions, the new syntax should not be a superset of the old syntax.  A good choice might be [{{caption}}] as a replacement for [caption].
  6. Allow unregistered shortcodes to become magical, giving no output, possibly an error message, or being easily escaped for display purposes.  This also greatly improves speed and scalability by not requiring a horrendously long regexp to match registered names.
  7. The opening and closing delimeters should be optimized for regex searches.  An escaped shortcode should not match the primary regex and might look like [!{{caption}}].  An enclosing shortcode might look like [{{caption}$]Hello[{{/caption}}]
  8. Add a separate default filter that runs the new syntax before all other filters.  Keep it simple.
  9. Adjust all of the core shortcodes to provide the newer syntax for new posts.

Any other feature requests or objections to having more syntax options available?

22 Aug 2015

Category:
Systems Engineering

Tags:

Discuss:
3 Comments

Comment Feed

3 Comments

  • chris says:

    I actually use has_shortcode a lot, and for basic implementations of it, it works incredibly well.

    I would consider deprecating the old style afterwards, and then could consider removing it after several major versions.

    One thing that would really help with this process is a routine that swaps “old syntax” to “new syntax” so if a person updates their shortcodes to the new system they don’t have to rewrite all of their content manually.

  • miqrogroove says:

    Recap of the meeting in the #feature-shortcode channel today:

    We could alleviate some of the concerns about the ban of HTML-in-shortcodes by introducing a new feature: Multiple enclosures. This would allow insertion of an arbitrary, possibly unlimited number of HTML enclosures for each enclosing shortcode pattern, rather than being limited to a single content variable. For consistency, we would need an extra delimeter for separating enclosures, and we have to ensure it also ignored inside of any HTML elements.

    There were no objections to the ban on shortcodes-in-HTML.

    Better core support of nested shortcodes is needed.

    The proposed syntax changes look confusing at first, but no alternatives were suggested.

    Attribute value encoding for special characters remains a point of confusion.

  • cfc says:

    Given the direction HTML 5 & Web Components are heading in, is the concept of a unique shortcode syntax even still necessary? Is there a good reason not to just use custom elements for this? AngularJS has already had good success with this approach.

    It would basically mean moving from:

    [gallery ids="729,732,731,720"]

    To something like:

    <wp-gallery ids="729,732,731,720">

    Since angle brackets already have special meaning in HTML, and already have to be escaped as a result, this completely avoids adding a new special character that would require escaping.

    Regex isn’t awesome for parsing HTML, but I think that problem is going to carry over to any markup-style syntax (i.e., any shortcodes that are allowed to contain attributes and surround other content). Markup languages by nature are just too complex to parse easily & reliably using regex. But there are good XML parsing features built into PHP that could handle this.

    It also allows unrecognized shortcodes to be completely ignored by the parser, since they’ll also be completely ignored by the browser.

Write a Comment