Mark Berry June 2, 2010
I’ve been hosting this blog on my own Windows server since 2007 using BlogEngine.NET. Although I’ve been fairly happy with BlogEngine, I wanted to move my web sites to a hosted solution. The new provider hosts on Linux, so BlogEngine isn’t an option. It didn’t take me long to discover WordPress, and I decided to give it a whirl.
I set up WordPress 2.9.2 and began testing the export/import of my posts. Wow, what a lot of work that turned out to be!
Get the Latest BlogML Import Plugin
BlogEngin exports in BlogML format. WordPress does not support BlogML by default. Aaron Lerch blazed the trail by writing the initial plugin for importing data in BlogML format. There have been a couple of updates; the most recent one I could find, by Wayne John, added the (important) ability to import tags. Read Wayne’s article and get the plugin here. Be sure to see his helpful article on importing images here.
Editing the BlogML.xml File
In addition to updating the images paths as described in Wayn’e’s article, I found that I needed to make a few more edits to BlogML.xml before importing it into WordPress.
Make these edits using Notepad++:
- Change image paths as described in Wayne John’s article.
- Change “[more]” to “<!—more—>”.
- Correct the paths to other files by searching for “.axd” and manually changing them to the path they will have on the WordPress site.
- Change “ ” to “ ”. This works around a nasty behavior of the import to truncate a post whenever it encounters a entity. Apparently, in order to get TinyMCE to leave these entities in place, you’ll have to hack WordPress as I describe in this forum post.
WordPress (and the way it implements the TinyMCE editor) has a strange relationship with the <br> tag. Basically the <br> tags will be stripped out and the line breaks lost. However it does respect a hard paragraph break (and I mean carriage-return-line-feed, not a <p> tag). Since my posts are full of <br> tags, I used Microsoft Word’s advanced Find and Replace to replace the line breaks:
- In Word 2007, go in to Word Options. Under Advanced > General, check Confirm file format conversion on open.
- Open BlogML.xml and choose to Convert file from Encoded Text, then select Unicode (UTF-8) as the encoding.
- Replace “<br />^p” with “<br>”. (“^p” finds paragraph tags, so this removes paragraph tags following <br /> tags.)
- Replace “<br>^p” with “<br>”. (As above, but finds <br> tags without the slash.) Now all <br> tags are not followed by paragraph tags.
- Replace “<br>” with “^p”. Now all <br> tags are converted to hard paragraph breaks.
- Save the file in Word and close Word. The file is now back in ANSI. Open the file in Notepad++, change the Encoding back to UTF-8, and re-save it.
Manual Edits after the Import
After importing BlogML.xml into WordPress, I found that I had a lot of manual editing to do on the posts.
- Search for posts containing <blockquote>. In instances where this was used to create an indent and not to quote someone, use the WordPress/TinyMCI Visual Editor to change the style from block quote to indented. (WordPress uses the tag <p style=”padding-left: 30px;”> to create an indent.)
- As I was working through the posts replacing <blockquote>s, I noticed that most posts that contained a less than sign (<) were missing data. This symbol is represented as < in the BlogML.xml file, so I laboriously searched through BlogML.xml in Notepad, opened each corresponding post in WordPress and BlogEngine, and copied over the text that got cut out.
- Even more alarming was to stumble upon a post that had been truncated, this time on a ± character, represented in BlogML.xml as ±. I copied the remainder of the text over to WordPress.
- Same thing as #3 happened with the last character in the word “voilà”: the à entity caused the post to truncate and I had to copy the remainder from the old blog to the new.
- Finally I should note that many of my older posts, creating before I started using Windows Live Writer, contain hard paragraph breaks. In HTML terms, those paragraph breaks are ignored and the paragraph flows normally. WordPress, however, interprets them as line breaks, inserting <br> tags before displaying them. I corrected this if I was making other edits, but I’m sure several posts still have odd line endings (example).
I’m afraid that other posts may have been truncated by special characters. I did use Notepad to do a regular expression search on for HTML entities with 5, 6, or 7 characters. Interestingly, the import did not truncate after “, ’, or ™. And considering that there are 671 instances of ", those must be working as well. Hopefully that covers most potential truncations. If not, I’ll keep BlogML.xml (or the old blog itself) and if more text is missing, copy it over as it is discovered.