Microsoft’s lack of sticking to standards rears it’s ugly head again

I was going to do a short article on how easy it is to extract information and generally mess around with Windows Live Spaces blog posts in your code using LINQ, however due to Microsoft’s lack of sticking to standards, this has become impossible for what I was intending to write the article about.

You still can do some things with Windows Live Spaces blog posts and LINQ but unfortunately just not the really interesting stuff. Let me explain.

You can access a Windows Live Spaces blog post through the RSS feed that Windows Live Spaces (WLS) exposes, for example "http://msnwindowslive.spaces.live.com/feed.rss". This is the main feed for site. You can even select more specific posts by category, for example "http://msnwindowslive.spaces.live.com/category/Programming/feed.rss". Simply insert "category/[category name]"  after the main url followed of course by "feed.rss".

Since an RSS feed is really only an XML document this can easily be loaded into an XElement or XDocument in LINQ :-

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
 
namespace LINQTest
{
    class Program
    {
        static void Main(string[] args)
        {
                XElement xel = XElement.Load("http://msnwindowslive.spaces.live.com/feed.rss");
        }
 
    }
}

Here we are simply loading the RSS feed into an XElement object.  Once you have the RSS feed you can then set about querying it. As a quick example here is some code displays all the blog post titles from a WLS RSS feed :-

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
 
namespace LINQTest
{
    class Program
    {
        static void Main(string[] args)
        {
                XElement xel = XElement.Load("http://msnwindowslive.spaces.live.com/feed.rss");
                IEnumerable<XElement> titles = xel.Descendants("item").Descendants("title");
                foreach (XElement title in titles)
                {
                    Console.WriteLine(title.Value.ToString());
                }
 
            Console.ReadLine();
        }
    }
}

As you can see, extremely easy. First we load the RSS into an XElement object as described above. Since there may be more than one blog post title in the feed, we can’t simply map it to a single XElement object therefore we create a collection of XElement objects using IEnumerable<XElement>. We call this collection titles and we assign all the Title elements that are sub-elements of Item to this collection. For those who are unfamiliar with RSS feeds, each blog post is an "Item". Within each "Item" there are numerous tags including "Title" which is the title of blog post.

BlogPostTitles

If you want the actual article itself you would simply swap "title" for "description" as the actual blog post is contained within the description element of each item.

Now this is where things were going to get interesting and the meat of article would come from. A blog post in WLS is really only an XHTML snippet . XHTML as you can probably tell from the "X" is really just an XML document. If it’s just an XML document then we should be able to load the actual blog post into an XElement and start playing about it, for example, extract all the images contained within the blog post, amend the XHTML etc.

Unfortunately because the XHTML is non-standard we cannot do this. When you try to parse the XHTML into an XElement object, you will receive a runtime error. On first inspection this is because all the angled brackets have been escaped (html encoded). I can definitely see why Microsoft done this and this is very easy to fix, simple replace the escaped character sequence with the actual angled bracket and then parse this string into an XElement object :-

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
 
namespace LINQTest
{
    class Program
    {
        static void Main(string[] args)
        {
                XElement xel = XElement.Load("http://msnwindowslive.spaces.live.com/feed.rss");
                IEnumerable<XElement> x = xel.Descendants("item");
                foreach (XElement el in x.Descendants("description"))
                {
                  string desc = (string)el.ToString();
                  desc = desc.Replace("&lt;", "<");
                  desc = desc.Replace("&gt;", ">");
                  XElement els = XElement.Parse(desc);
                  IEnumerable<XElement> imgs = els.Descendants("img");
                  foreach (XElement img in imgs)
                  {
                    string imag = img.ToString();
 
                  }
                }
            Console.ReadLine();
        }
    }
}

Here we are simply looping through all the blog posts, changing the escaped angled brackets back to what they should be and then parsing that into an XElement. Unfortunately when you run this code you get an error.

RuntimeError

In this example the error is caused by the following :-

<a href="http://technorati.com/tags/AJAX" rel=tag>AJAX</a>

Can you spot why this line would through up an error?  This is given because the XHTML that Microsoft uses in WLS is non-standard. There is a mixture of quoted and unquoted attributes. According to official specs, all attributes should be quoted and the "rel=tag" attribute is not quoted. Upon noticing this I went through the XHTML that WLS puts out and there numerous examples of Microsoft is not playing by the rules. Now when parsing this into LINQ, LINQ is expecting all attributes to be quoted, since this one is not it doesn’t know what it is and throws the error.

All the example blog posts that I was taking RSS feeds from were created using Windows Live writer. Therefore I think that it’s actually Windows Live writer that spits out non-standard XHTML and not the fault of WLS which really is just the display container, although I would be surprised if the editor within WLS itself actually gives standards compliant XHTML.

So there you have it. What could have been a fairly interesting article showing you how to play around with WLS blog posts in your own code using LINQ, cut short by Microsoft’s non-standards compliance.