Markdig



There are the folks who write YAML, and then there are the folks required to parse it. Let me assure you that the former is more natural than the latter. YAML, which stands for YAML Ain’t Markup Language, is a tricky syntax which promotes itself as a human-friendly data serialization standard for all programming languages.

What makes YAML tricky? The first audience for YAML is humans, not computers. The syntax aims to support constructs that can be hell to program for, with my favorite example being values that can either be a single value or an array across documents. The ambiguity of the syntax has spurred on multiple “YAML sucks” factions all over the world.

While the syntax can cause headaches, I still like YAML. YAML serves its purpose well in simple scenarios. This blog contains all post metadata in a terse front matter block.

  • Markdig, according to its description, 'is a fast, powerful, CommonMark compliant, extensible Markdown processor for.NET'. While most of our older projects use MarkdownDeep (including an increasingly creaky cyotek.com), current projects use Markdig and thus far it has proven to be an excellent library.
  • See full list on jonhilton.net.
  • Markdig.MarkdownPipelineBuilder.UseDiagrams Here are the examples of the csharp api class Markdig.MarkdownPipelineBuilder.UseDiagrams taken from open source projects. By voting up you can indicate which examples are most useful and appropriate.
  • Markdig by Alexandre Mutel. 0.23.0 16 Jan 21 Toggle Dropdown. Version 0; 0.23.0 16 Jan 21; 0.22.1 2 Dec 20; 0.22.0 5 Oct 20; 0.21.1 17 Aug 20; 0.21.0 17.

In this post, you’ll see how we can use YAML.NET and Markdig to parse a blog post’s front matter.

Apr 15, 2021 MarkDig uses a configuration pipeline of support features that you can add on top of the base parser. The.UseAdvancedExtensions method adds a number of common extensions (like GitHub-Flavored Markdown, List Extensions, etc.), but you can also add each of the components you want and customize exactly how you want Markdown to be parsed.

Getting Started

We’ll first need to install two packages: YAML.NET and Markdig.

We will use Markdig to parse our blog posts, although we could also parse our files based on the standard convention of --- separators and forgo Markdig altogether. For this post, we’ll use it to handle slight discrepancies that might creep in between files.

The Front Matter

Jekyll powers this blog, a blog engine written in Ruby. It has a flexible front matter format that can adapt to the needs of the author and the theme applied to the blog. This blog post’s front matter looks something like this.

We represent all the values as string values, including the tags (a particular quirk of a paging plugin).

The C# Code

The first step we need to take is to create a class that represents our blog’s front matter. Note that many blogs will have different keys, and could vary from this example.

Once we have our front matter, we can create an extension method for our the Markdig MarkdownDocument class. The class will be where we pull our YAML front matter from in the form of a YamlFrontMatterBlock.

Writing custom Markdig extensions - Articles and ...

We need to remove the --- separators from our YAML block, or else YAML.NET will throw an exception. Another critical element of our code is the inclusion of MarkDig’s YAML frontmatter extension.

Without the extension, we won’t see the YAML block in our parsed document.

Finally, we can call our extension method after reading our files from disk.

Running our code, we see the results of our front matter parser.

I hope this helped! Enjoy!

Update 16 June 2016: Added a Benchmarks section

Earlier this year, for two intensive months, I have spent all my spare time implementing Markdig, a new Markdown processor for .NET. In this post, I will report back about the quirks and pitfalls I have found while coding this as well as the performance and design considerations I had to balance with.

The post is organized like this:

  • A Why? introduction where I explain why the need for a new .NET Markdown parser
  • The challenges of Markdown and the CommonMark specs
  • The challenges of developing an efficient extension system
  • Performance is in the details, where I visit the noticeable C# tricks and code I had to use to achieve good performance

First, you may wonder why another Markdown processor for .NET? The simplicity of Markdown is so delightful (I don't see any competitors as simple and readable as Markdown) and since 2011, I have seen it growing steadily, started from github and recently getting even more traction (by Microsoft documentation, language workbooks - for R, for C# by Xamarin - or with Rust/Swift language code comments...etc.), a format becoming as important as HTML, here to stay for a long, long time. And finally a format for which I have plenty of ideas to use it in some upcoming projects... (hence the reason of my direct personal interest! ;)

So for such an important format, we need a rock solid library for .NET!

And there is actually not so many packages for .NET (compare to in Ruby, or Php, or JavaScript). But I was also specifically looking for an implementation that was:

  • Super fast (no regex) and GC friendly
  • CommonMark compliant
  • Providing an extensible plugin architecture, including to be able to change the behaviour of the core parser
  • Builtin with many extensions (like pipe tables...etc.)
  • Supporting .NET Core (though this one is easy to add to any existing library)

I evaluated the following existing libraries:

  • MarkdownSharp and a fork on github: Used (or at least developed) by StackOverflow, port mainly based on the original PERL implementation, Regex based, no much extensions.
  • Marked.NET, port of Marked from JavaScript, Regex based, few extensions.
  • MarkdownDeep: Full parser, code repo is a bit rough though, some extensions.
  • CommonMark.NET: Full parser, port of cmark, no extensions (on master)

Among the existing .NET library, two of them were out of the crowd: MarkdownDeep and CommonMark.NET, but they were still lacking what I was looking for:

Features Markdig (New!) CommonMark.NET MarkdownSharp MarkdownDeep Marked.NET
Based new Port of cmark (master1) Perl new JS
Tech based Full Parser Full Parser Regex Full Parser Regex
CPU Friendly
GC Friendly
Extensions
20+
no (master1) no
Plugin Architecture yes no (master1) no no no
Remove/Change builtin features yes no no no no
extensions

Markdigi

1: For CommonMark.NET, there is a full rewrite of the parser in a pipe_tables branch, not yet official, not finished? I also took the time to evaluate a bit more this one, but I was not really convinced by some design decisions (e.g: using only composition over a simple inheritance scheme for the syntax tree) and a plugin architecture that was not enough versatile for what I was looking for... Still, overall, CommonMark.NET is a pretty solid library, perf and code quality wise. To boostrap the development, I have re-used some of their decoding primitives for Markdig (e.g HTML Entity decoding)

So I started to challenge if I could write a full CommonMark compliant parser with all the features and dreams I had in mind...

Luckily, the CommonMark specs done by John Mac Farlane are amazing and frankly, without them, I wouldn't have even started this colossal work, because if the code was laborious and sometimes tricky, the specs are much more involving...

So what is exactly difficult with parsing Markdown? The first thing you will notice is that Markdown doesn't give you any syntax errors. Everything you type is somewhat valid (not what you may want to do, but it is still valid). If you are used to write parsers (or you remember your compiler class!), you won't find a formal grammar for Markdown and the reason is that some constructs can be ambiguous. Emphasis handling in Markdown is one example. But even with the CommonMark specs, while developing it, I came across some cases where it was still not clear:

Typically a link followed by a setText heading:

You can have a look at the differences of this example on babelmark3 (A service I released recently, check-it out!). Obviously, it is not well handled. It is not surprising then that there is even a CommonMark compliant processor that is giving a different result...

While coding this specific case, I had for example to plug the handling of link definitions and setText heading into the ParagraphBlock parser:

  • Link definitions when closing the parsing of a paragraph block (See here)
  • setText heading when continuing parsing a paragraph block (See here)

The good thing is the CommonMark specs still come with an appendix 'a parsing strategy' that helped a lot (even though I didn't follow exactly the recommended approach for handling emphasis, which is the trickiest part)

Basically the parsing is done in two steps:

  • block parsing time: It identifies the paragraph, block quotes, headings, fenced code blocks...etc.
  • inline parsing time: It identifies syntax inside a block like emphasis, links...etc.

Things like spaces, tabs, blank lines have their importance in Markdown. You cannot ignore them and put them aside.

There are also some lazy continuation rules. For example, a paragraph can continue on a following line and will be considered as part of the previous line (so in our case, as part of the block quote that is still pending)

This kind of rules make it nearly impossible in a formal grammar, even with an infinite look-ahead parser.

So, needless to say, this was laborious to go through all the CommonMark specs examples (more than 600+) and get all them working properly. On my way to the 100% complete, I was of course striken all around by regressions, when finishing one feature and realizing that I have broken something I did the previous day! But wow, this is where I realized more than in any previous projects that tests are pure gold for a project, more important than the code itself.

In the end, it took me around two weeks (working early in the morning and in the evening) to build the core parsing part of CommonMark. But the fact is that implementing the core specs is not enough, you obviously want to have all the useful extensions (like pipe tables), automatic identifiers...etc. And this part brought in its own other challenges... and I spent the rest of the 1.5 months coding all the extensions, polishing the code and the performance, and that was not an easy part as well...

There are currently no real specifications for extensions in CommonMark, as they are waiting for the core part to be stabilized before proceeding further. So I had to leverage on many useful discussions on the CommonMark forum

The first one I tried was pipe tables: for this, I relied on the behaviour of some of the best implems out there (PHP Markdown Extra by Michel Fortin or Pandoc my John Mac Farlane) to integrate the behaviour in Markdig. But it turned out that I had to relax the parsing strategy above (two steps, blocks first, then inlines) in order to handle them correctly.

Typically the following should be a table (example on babelmark3):

But this should not (example on babelmark3):

While performance has been on the radar quite early while developing Markdig, implementing the 20+ extensions has put lots of pressure on the design of Markdig. And they were both sometimes fighting against each other. Because I wanted Markdig to be fully customizable (at some degree of course), I had to insert many pluggable entry points in the API to make it possible. Doing so, I had to face many optimization and/or design challenges.

Typically on the design part, when I added support for grid tables (example on babelmark3) it required the parser to support nested block parsing: A cell in a table is considered as a sub 'Document' part (with its own lines), so the parser had to be able to spawn itself in a mode where it could handle such cases.

For the performance part, Markdown is challenging because you need to handle efficiently character sequences, and unlike a conventional Markdown parser that use a standard switch case for handling cases, you cannot do that with a pluggable architecture. Every 'token' characters can be pluggable, and they may affect multiple parts in the pipeline.

So instead of having something like this (as seen in cmark):

you need to handle each incoming character like this:

Each block and inline parsers in Markdig defines a set of OpeningCharacters. Here for example for a BlockQuote, it is simply define as OpeningCharacters = new[] {'>'}; but for handling regular emphasis (like **...**) and all additional emphasis (strikeout, del, insert...etc.), you need to dynamically adapt the list of Opening Characters based on what is registered.

Things gets worse handling inlines inside a paragraph block. For instance, if you look at the beginning of this phrase, there is no characters that could be interpreted as inline elements (e.g a pair of * for an emphasis), so when you want to efficiently handle this line, you want to eat characters with a bulk method. You don't want to go through the Parsers.GetParsersForOpeningCharacter(c); and subsequent loop to handle each single character. So in that case, the InlineParser, which is responsible of processing a sequence of literal text (that don't have any other meaning for inlines), has to be optimized.

Example of InlineParser.cs with bulk parsing:

Here, we try to find in the coming characters any characters that could be an opening characters of another inline construct. Behind the scene, this method is optimized only to iterate quickly on the string and doing nothing else than checking if the character is an opening character.

Because any new extension may come with a new character that may be interpreted differently, you need to have a pluggability at the character level. One reason why you cannot really express such an efficient pluggable Markdown parser with just a set of regexp or a fixed switch matching pipeline.

An example of such constraint is when I have added support for additional ordered list with starting letters (supporting a)b)c) and i)ii)iii)), you need to quickly identify if we are in front of a list starting a) or a paragraph starting by the letter b, or c or a-z!. Without an efficient lookup method, this single extension would have killed the performance of Markdig (though this has still a non negligible impact even with proper optims)

Markdig Alternatives - .NET Markdown Processors | LibHunt

Also I wanted to be able to change the behaviour of the core parsing of CommonMark (or disable some parts, like HTML parsing). You may need (?) for instance to test a different type of heading starting by a @ instead of #... This requirement forced to consider the core CommonMark parsers as plugins as well.

At the beginning, without the extensions, with still a relatively pluggable architecture, Markdig was as fast as CommonMark.NET(master), which is only 15% slower than the C implementation. But once I finished building the 20+ extensions (while often checking the performance regressions carefully), I had to admit that I couldn't achieve the same amount of performance with a more pluggable architecture.

That was the price to pay. Though, three remarks I would like to emphasis (even if they are obvious, it is not bad to recall them, starting for myself!):

  • The first is that pluggability has usually a negative impact on the performance no matter how hard you try to mitigate this
  • The second is: don't expect a plugin system to be universal, there is no such thing! The Unknown cannot be known in advance... Someone may come one day with a new plugin requirement that could shake the foundation of your library (no less!)
  • Lastly, make your plugin system simple! Otherwise nobody will want (or know how) to develop a plugin. If it is laborious to develop or too much convoluted, burden under tons of interfaces, abstractions, layered compositions, indirections, you will have a beautiful plugin system but nobody on earth will tell you this! Though, I will still have to challenge this for Markdig once I have written the documentation for it... ;)

The good thing is that I was able to put Markdig performance as efficient in terms of GC pressure, and among the best for CPU, but this required many little tricks and discipline in the code to achieve such a goal...

Very early in the development, I have measured regularly the performance of the parser, both in terms of GC and CPU time. I have been quite pleased at some point to switch to the amazing BenchMarkDotNet which helped to quickly setup many small tests for the various performance cases I wanted to improve. In the meantime, they have added support for GC events, which is a must to measure (and better explain) the performance advantage of a method over another (Though there is a pending issue to make it fully reliable in BenchMarkDotNet).

Since then I have been using systematically BenchMarkDotNet for various bench tests and It is of great help, as being able to easily setup a benchmark in seconds help to actually not being too lazy with performance!

First, there are some general hints that you are most likely already aware of:

  • Don't use Linq in any critical part of a your code paths, at it allocates memory (for both the lambda instance and the cast to IEnumerable)
  • Prefer using List<T> or Dictionary<TKey, TValue> instead of using IList<T>, for the same reason for the IEnumerable, as you would force the boxing of the IEnumerable which are usually implemented as structs and exposed via duck typing on concrete types.

Most of the techniques used here are to lower GC pressure, improve locality and help to reduce CPU footprint.

1) StringSlice, a lightweight string part

In the case of Markdown, a large part of a document is composed of small strings that will end-up written directly to the output when converting to HTML. In this case, and this is also often true for text templating system, you don't want to perform any substrings on the original full Markdown document. You just want to work with slices/views of it without the burden of allocating additional objects.

Hence, in Markdig, there is a StringSlice type for handling such a case.

It has basically 3 fields:

  • the original string (the full Markdown text)
  • a start position within a string
  • an end position within a string (or the length)

Note that I declared also this type as a struct, as It is critical to avoid allocating slices on the heap, otherwise it would have been basically be the same as performing substrings.

So for instance, if you load the full Markdown document to process into a single string, and work with StringSlice on this string, it will generate around 30% less allocations rather then reading the source document with a TextReader line by line (allocating a string for each line).

For this case, I wanted to use a dedicated optimized slice for strings (with useful attached method that cannot be put as extension methods, as we don't have yet implicit ref for them!). Otherwise, I can suggest also Slice.net by Joe Duffy for a generic version of it.

2) Use custom struct List<T>

Any time I had to use a List<T> as an internal object state to store a variable array of items, I have replaced it with a version handling directly the array.

For instance, the StringSlice can be organized in a StringSliceGroup, this type is again a struct. We save the cost of the List<T> object which is a wrapper around T[]

3) Object pooling

Pooling is one regular technique to reduce GC pressure. Specially with StringBuilder which allocates chunks of memory behind the scene!

In Markdig there is a generic ObjectCache<T> along a specialized StringBuilderCache

A notable addition to the StringBuilderCache is a Thread Static Local (TLS) for accessing a local StringBuilder from almost anywhere inside a method. Instead of having to pass around an ObjectCache to each method signature, I could use just StringBuilderCache.Local() whenever I needed, without also having to release it. Note that such a class you need to make sure that there is no nested usage of it in a nested method call stack.

4) Gentle StringBuilder allocations on changes only

It is quite common that you need to substitute from a string some elements (which requires more than a Replace parsing method) and return a new string. Most of the time, a naive implementation is to create a new StringBuilder and start to process the original string. In the end, if nothing was really changed, we have still allocated two objects for nothing! (The StringBuilder and its StringBuilder.ToString()), and if this could be the most common case, that would hurt the perf!

5) string.IndexOf() and string.IndexOfAny()

If there is a need to scan for a particular character in a string and process things afterwards from this position, you may think that a custom string.IndexOf would be better, but it is actually implemented in the CoreCLR as almost an intrinsic, much faster than whatever you could come with a managed and unsafe equivalent, so trust IndexOf, it is fast!

On the other hand, if you need to check for many characters, the string.IndexOfAny is not that fast, and for matching opening characters in Markdig, I had to write a dedicated version, assuming that you know in advance the characters to match and can build pre-computation tables for them.

The class CharacterMap<T> was extensively used in Markdig to pre-compute maps between a char and an associated data instance. First used as a pluggable replacement for switch/case, but also for IndexOfAny, the following method IndexOfOpeningCharacter was then used as a faster replacement:

5) TextReader.ReadLine() fast but...

I started the implementation by using TextReader.ReadLine() when reading the input Markdown document. The great thing about the implementations of this method is that they are usually enough fast.

The StreamReader.ReadLine for example is well optimized, using internally a pool of StringBuilder

The StringReader.ReadLine is also performing only a fast scan of newline character and substring on the original string.

My only grip with these methods is that you are loosing the critical information of the new line character (was it a r or rn or n?), hence loosing the offset in the stream if you want to provide an accurate position within the original stream.

Markdig is no longer using these methods but instead is loading the full Markdown document into a string and is working with StringSlice, as I realized that it was more efficient to load the document once into a string instead of loading each line in a new string. This method provides also a much better handling of precise source code position (e.g for syntax highlighting).

Also one thing that was a bit frustrating when trying to squeeze out some performance from string handling was the internal accessibility of the FastAllocateString which allows to allocate a string on the heap without zeroing it, very useful, used in many places in the .NET framework but inaccessible for us, unfortunately... :(

6) List<T> or T[] instead of Dictionary<string, T>

This is also not something new, but for small Dictionary (last time I checked, I guess it was for less than 10 elements, may be around 7-8), an array gives actually a faster access and also is much cheaper in terms of memory. This is used in Markdig to store attached HtmlAttributes to each syntax tree and it helped to save a few mega bytes when you have hundred of thousands of instances all around when parsing large documents!

On average, even if you replace internally a Dictionary<XXX,TTT> by a List<T> (assuming that T contains an access to the key), you will see 4x to 5x times better performance (just for the allocating/feeding part, access perf also better when number of elements are < 10). See this gist for more details.

This is an early preview of the benchmarking against various implementations:

C implementations:

  • cmark (version: 0.25.0): Reference C implementation of CommonMark, no support for extensions
  • Moonshine (version: : popular C Markdown processor

.NET implementations:

  • Markdig (version: 0.5.x): itself
  • CommonMark.NET(master) (version: 0.11.0): CommonMark implementation for .NET, no support for extensions, port of cmark
    • CommonMark.NET(pipe_tables): An evolution of CommonMark.NET, supports extensions, not released yet
  • MarkdownDeep (version: 1.5.0): another .NET implementation
  • MarkdownSharp (version: 1.13.0): Open source C# implementation of Markdown processor, as featured on Stack Overflow, regexp based.
  • Marked.NET (version: 1.0.5) port of original marked.js project
  • Microsoft.DocAsCode.MarkdownLite (version: 2.0.1) used by the docfx project

JavaScript/V8 implementations:

  • Strike.V8 (version: 1.5.0) marked.js running in Google V8 (not .NET based)
Fast,

Analysis of the results:

  • Markdig is roughly x100 times faster than MarkdownSharp, 30x times faster than docfx
  • Among the best in CPU, Extremelly competitive and often faster than other implementations (not feature wise equivalent)
  • 15% to 30% less allocations and GC pressure

Because Marked.NET, MarkdownSharp and DocAsCode.MarkdownLite are way too slow, they are not included in the following charts:

See Full List On Nuget.org

Performance for x86:

See Full List On Markdownguide.org

Performance for x64:

Though this has been quite laborious to develop, I have enjoyed quite a lot developing Markdig. It was so exciting to see the number of tests passing increasing days after days of hard labour.

Use Markdown With ASP.NET Core - CODE Mag

And with all the effort and love I have put in this library, Markdig is most likely the best Markdown engine in town for .NET, and probably one of the best around language apart, both feature and performance wise (doh, no less?!), so spread the words folks! :D

I hope it will open opportunities in your projects as I hope it will do for my own... Happy coding!