Proposal for WordPress Import/Outport of GSoC2010

Overview

The objective of the project is to contents from a remote WordPress blog without using an XML file. It can be processed by using the method of XML-RPC, which has been supported by WordPress. I will firstly make a comparion among three possible techniques and explain why XML-RPC is the best choice. Then the details of the import will be discussed and what I have done will be listed. The last part will introduce the challenges and risks that may appear and my personal opinions.

Choice of Method

The objective can be met by basically three techniques.

  1. Copying the whole database
  2. Recognizing HTML tags
  3. XML-RPC

The first method sounds quick and simple. But in this way, credential for database is needed, while not all the users have an access to database. Besides simply copying the database will lead to wrong options for new blog, since the old options depend on the old domain name and they are integrallty copied. This is the very issue expected to be solved in Moving WordPress. So I don’t think mothod 1 proper to this project.

The second method is to recognize the HTML tags when reading the web page. This is my initial thought when seeing this project. For instance, the structure of a post may be like below.

<div class=”title”>       title of the post                 </div>
<div class=”author”> author of the post              </div>
<div class=”date”>      date of the post                </div>
<div class=”entry”>    content of the post            </div>
<div class=”comments”> coments of the post   </div>

The specific element can be extracted by using regular expression. However, later I found that as the themes differ,  the HTML tags vary. For example, the content of the post may lie between <div class=”entry”> and </div> as above, but it is also possible to lie between <div class=”content”> and </div> when in another theme. It depends heavily on the theme design and the HTML tags get a big chanciness. So I don’t think it easy to write such a script that can deal with the various HTML tags for right things we need.

Finally, we come to the third method, XML-RPC (remote procedure calling using xml as encoding). By sending an XML-RPC request, we can get cotents encoded in line with the xml standard. The chanciness of HTML tags is eliminated and the import is easy to carry on for the well-formed contents.

WordPress has included the XML-RPC method and it is convinient to use. It is not necessary to do tons of useless things such as transferring the whole page, recognizing HTML tags by regular expression in method 2. Just send a request with the chosen API, and the response is concise and well-formed. I think this is the best method both for exporting contents from old blog and importing them into the new one.

From the above, I choose XML-RPC to achieve the objective. The details will be discussed in the following section.

Details of Import

Firstly an XML-RPC client need to be constructed. I plan to use Incutio XML-RPC Library to build the client. The library can be found at wp-include/class-IXR.php. The general steps to make a client, send a request and receive response can be found here.

After the client is constructed, it is ready to send a request to XML-RPC server with the chosen API and the credential for the old blog. WordPress now supports four different classes of API and contains nearly all the functions we need. See this page for details. For different contents, I plan to use different classes of API as discussed below.

  • Posts: metaWeblog.getRecentPosts
  • Comments: wp.getComments
  • Tags: wp.getTags
  • Categories: wp.getCategories
  • Pages: wp.getPages

For links I am not planning to use XML-RPC. WordPress already has an import tool for links using opml and it works well.

The response will contain all or a customized number of the objects arranged in an array. For instance, API metaWeblog.getRecentPosts will return a customized number of posts containing title, content, created time, tags and other information. If the number is over the total, all the posts will be returned. Other APIs will return such an array of the objects they are responsible for as well.

When it comes to import, it is easy to extract each element we need from the response, for it is formed as an array. So far, I have finished posts and corresponding comments import, tags import, categories import and pages import, respectively. The source files are listed as below.

WordPress has already done the links import. But for personal need, I wrote an import tool that doesn’t need opml. The tool is called Simple Links Importer.

For media files, XML-RPC seems not to support to transfer them. I will discuss media files’ import in the Section Challenges and Risks.

All I need to do is to integrate these separate tools into a whole, which can properly import all the objects. For there are some relationship between them, the process of import must be considered carefully. Generally speaking, a comment is related to a specific post. A post may be related to several tags and categories. A link may be related to a specific link category. The relations are stored in different tables, which may make the import process a little more complicated.

In my opinion, the tags and categories should be imported first, and the following may be the posts and pages. During importing every single post or page, the information of its tags and categories can be imported according to the existing terms. After that, comments corresponding to every single post and page can be imported through matching the post id or page id. The links should be added at last for it is less involved by other objects.

The order of import is waiting to be discussed and improved for better and faster process. This is only my initial thought.

User Interface

A friendly user interface is necessary for a good software, as well as various import options that can be customized by users. For blogs Ajax is popular and good to make such an interface. As for options, it is necessary to have lots of discussion for what options should be available for users considering both flexibility and software robustness.

Form of  the Tool

One thing could be previousely comfirmed is that the software is meant to exist as a importer tool just like the others stored in wp-admin/import/. It would not be loaded when users login and they can call it whenever they want. And the tool can be installed or uninstalled smoothly without any revision to database or other codes. I think this is the best choice for efficiency and light-weight of WordPress core codes.

Challenges and Risks

Although I try to consider comprehensively but there are still several problems leave to be solved.

  1. The first one is the missing of information. Using XML-RPC, despite that we can get well-formed contents and insert them into database conviniently, there are some information unavoidable to lose. This issue mainly roots in the design of the API. For instance, when using wp.getComments, the response would not include the date of the comments, though the date_gmt is returned. Such things would not obstruct the import process too much but I still hope there is certain solution for the information missing.
        
  2. The second one is that media files cannot be imported by XML-RPC. In the idea of Moving WordPress, this is an important thing to do besides contents import. In my opnion, it can be achieved by using the PHP file system and searching for certain types of links. We can previously import the posts, and then use regular expression to find out specific types of links in the posts, like links for picture or mp3. Next, use functions of PHP filesystem like file_get_contents() to download the files to the new blog. Lastly replace all the links for the new domain name and the files positions, which is also an important task for Moving WordPress. However, many problems are likely to occur for the changeful Internet environment and website’s options. This is only my initial thought, and more discussion and study is needed before this problem can be solved.

    Another way to transfer the media files is using ftp. I saw many proposal for Moving WordPress discuss this method. The meida files on old blogs should be zipped before the transport. An intermidiate server can be optional in case of the bad Internet speed. MD5 check can be used to identify the files’ integrity. I am not sure if this way fits WordPress Import/Export. I would like to take more discussions and study to figure out which way is proper and how to achieve it in code.
          

  3. The last one may be the interruption recovery system. For a blog with lots of posts and comments, it may take a long time to complete the import process. It has a large chance that the process is interrupted and some changes have been made to the database. The recovery system should promise a correct connection to the unfinished work and finally import all the contents properly. This is not an easy thing to do for the various causes of interruption. However I would like to try to achieve this in this summer of code.

Potential mentors

No preference.

Summary

All the above is my proposal for the WordPress project Import/Outport of GSoC2010. The two lists below show what I have done and what to do of the project.

Already done:

  • Posts import
  • Comments import
  • Links imports
  • Tags import
  • Categories import
  • Pages import

To do:

  • Imporvement and integration of the already-done
  • User interface design
  • Interruption recovery mechanism
  • Information missing handling
  • Media files import

This is my first year applying for GSoC and I choose this project out of my personal interets. I have been working at this issue for my personal need, during which I cultivated a great passion for WordPress and had some insights about it. Therefore, I desire to complete my work and make it more practical. I am looking forward to participating in GSoC so that with your help I can acquire more knowledge on WordPress and finish the project smoothly. What’s more, I hope that my effort can contribute to the improvement of WordPress.

Comments (2)