Home Interviews Developers Our View Feeds The List

LiveSide - Developer Blog

Microsoft's lack of sticking to standards rears it's ugly head again

I was going to do a short article on how easy it is to extract information and generally mess around with Windows Live Spaces blog posts in your code using LINQ, however due to Microsoft's lack of sticking to standards, this has become impossible for what I was intending to write the article about.

You still can do some things with Windows Live Spaces blog posts and LINQ but unfortunately just not the really interesting stuff. Let me explain.

You can access a Windows Live Spaces blog post through the RSS feed that Windows Live Spaces (WLS) exposes, for example "http://msnwindowslive.spaces.live.com/feed.rss". This is the main feed for site. You can even select more specific posts by category, for example "http://msnwindowslive.spaces.live.com/category/Programming/feed.rss". Simply insert "category/[category name]"  after the main url followed of course by "feed.rss".

Since an RSS feed is really only an XML document this can easily be loaded into an XElement or XDocument in LINQ :-

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
 
namespace LINQTest
{
    class Program
    {
        static void Main(string[] args)
        {
                XElement xel = XElement.Load("http://msnwindowslive.spaces.live.com/feed.rss");
        }
 
    }
}

Here we are simply loading the RSS feed into an XElement object.  Once you have the RSS feed you can then set about querying it. As a quick example here is some code displays all the blog post titles from a WLS RSS feed :-

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
 
namespace LINQTest
{
    class Program
    {
        static void Main(string[] args)
        {
                XElement xel = XElement.Load("http://msnwindowslive.spaces.live.com/feed.rss");
                IEnumerable<XElement> titles = xel.Descendants("item").Descendants("title");
                foreach (XElement title in titles)
                {
                    Console.WriteLine(title.Value.ToString());
                }
 
            Console.ReadLine();
        }
    }
}

As you can see, extremely easy. First we load the RSS into an XElement object as described above. Since there may be more than one blog post title in the feed, we can't simply map it to a single XElement object therefore we create a collection of XElement objects using IEnumerable<XElement>. We call this collection titles and we assign all the Title elements that are sub-elements of Item to this collection. For those who are unfamiliar with RSS feeds, each blog post is an "Item". Within each "Item" there are numerous tags including "Title" which is the title of blog post.

BlogPostTitles

If you want the actual article itself you would simply swap "title" for "description" as the actual blog post is contained within the description element of each item.

Now this is where things were going to get interesting and the meat of article would come from. A blog post in WLS is really only an XHTML snippet . XHTML as you can probably tell from the "X" is really just an XML document. If it's just an XML document then we should be able to load the actual blog post into an XElement and start playing about it, for example, extract all the images contained within the blog post, amend the XHTML etc.

Unfortunately because the XHTML is non-standard we cannot do this. When you try to parse the XHTML into an XElement object, you will receive a runtime error. On first inspection this is because all the angled brackets have been escaped (html encoded). I can definitely see why Microsoft done this and this is very easy to fix, simple replace the escaped character sequence with the actual angled bracket and then parse this string into an XElement object :-

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
 
namespace LINQTest
{
    class Program
    {
        static void Main(string[] args)
        {
                XElement xel = XElement.Load("http://msnwindowslive.spaces.live.com/feed.rss");
                IEnumerable<XElement> x = xel.Descendants("item");
                foreach (XElement el in x.Descendants("description"))
                {
                  string desc = (string)el.ToString();
                  desc = desc.Replace("&lt;", "<");
                  desc = desc.Replace("&gt;", ">");
                  XElement els = XElement.Parse(desc);
                  IEnumerable<XElement> imgs = els.Descendants("img");
                  foreach (XElement img in imgs)
                  {
                    string imag = img.ToString();
 
                  }
                }
            Console.ReadLine();
        }
    }
}

Here we are simply looping through all the blog posts, changing the escaped angled brackets back to what they should be and then parsing that into an XElement. Unfortunately when you run this code you get an error.

RuntimeError

In this example the error is caused by the following :-

<a href="http://technorati.com/tags/AJAX" rel=tag>AJAX</a>

Can you spot why this line would through up an error?  This is given because the XHTML that Microsoft uses in WLS is non-standard. There is a mixture of quoted and unquoted attributes. According to official specs, all attributes should be quoted and the "rel=tag" attribute is not quoted. Upon noticing this I went through the XHTML that WLS puts out and there numerous examples of Microsoft is not playing by the rules. Now when parsing this into LINQ, LINQ is expecting all attributes to be quoted, since this one is not it doesn't know what it is and throws the error.

All the example blog posts that I was taking RSS feeds from were created using Windows Live writer. Therefore I think that it's actually Windows Live writer that spits out non-standard XHTML and not the fault of WLS which really is just the display container, although I would be surprised if the editor within WLS itself actually gives standards compliant XHTML.

So there you have it. What could have been a fairly interesting article showing you how to play around with WLS blog posts in your own code using LINQ, cut short by Microsoft's non-standards compliance.

Comments

 

MisinformedDNA said:

The article was not cut short by Microsoft's non-standards compliance, it was cut short by what looks like a plugin for Technorati for WLW. Blogging sites often offer non-compliant sites because you can't expect bloggers to even know what the standards are. MS could autofix the HTML, but it it likely that that would cause an uproar from MS editiing your post.

Now if you really want to make an interesting article. You can do 2 things. 1) Post about something that isn't obvious. Scott Guthrie already covered this topic pretty well here: weblogs.asp.net/.../using-linq-to-xml-and-how-to-build-a-custom-rss-feed-reader-with-it.aspx and 2) Look at www.codeplex.com/htmlagilitypack. It will fix malformed HTML for you.

Just do me one favor, don't use LiveSide as a megaphone for your personal griping. Don't report without researching ala "although I would be surprised if the editor within WLS itself actually gives standards compliant XHTML". If you want to report, do the research first.

 Hackersoft:
Thank you MisinformedDNA for the information you provided. However if you'd actually studied Scott's blog post that you linked to, you would see that he is actually using an RSS feed that has correctly formed HTML. Therefore a completely different scenario.
Also if you studied the RSS feed I pointed to you would notice that it is infact non-standard html and therefore LINQ to XML would still have failed. What I didn't publish was that I started creating code to fix the problems as I found them, got well into the body of the RSS feed before giving up as it was just not worth the hassle.
I agree that bloggers probably don't know compliant HTML and there should be no need for them to know. The tools should handle all of that for them, which Microsoft's are not.
And 2). It is impractical to run every single tag and possible attribute combination through the htmlagility pack. That defeats the purpose and doesn't fix the underlying problem of sloppy coding on Microsoft's part. Why should I or anyone need to do this?

So yes, I have done the research, which you obviously have not fully.

February 27, 2008 6:48 AM
 

kenbw2 said:

I'm not as picky as my friend MisinformedDNA here. But I do hve one small gripe.

rears ITS ugly head!

If you're going to be a journalist, learn the English language!

February 28, 2008 4:48 AM

LiveSide Latest Posts

Web Slice coming soon...
Copyright (c) 2006-2007-2008 Liveside
Listed on the Offical CS Listings Powered by Community Server, by Telligent Systems Themed By nb development Banner Logo By pxb Designed By Mark Sutherland