Import DokuWiki pages into WordPress

I used to maintain a wiki full of personal ideas using DokuWiki, but at some point just gave up with wiki syntax entirely. A few months ago when I was moving all of my sites and content over to a new server and trying to consolidate things as much as possible, I decided to import all of that old content into a WordPress install (which was actually a single site within the same Multi-site install that runs Dented Reality). I ended up writing the following super-rough script to just scrape the contents of the pages and throw them into WordPress. Scraping the pages meant that I could get the actual output of all plugins etc, and also get full links between pages.

The script requires minimal configuration, and then you run it from the command line (php the-file-you-saved-it-as.php). It will loop through the index pages of your wiki, download each page and parse out the actual content (I have only used it on the default template, so I don’t know how it’ll go with anything else). It strips out a few things (table of contents, edit links) and attempts to fix up internal links by replacing _ with -. It will also attempt to maintain parent/child relationships of pages based on their namespaces. It worked for me, but YMMV. Have at it:

<?php
/**
* Super rough and tumble WordPress import script for Dokuwiki.
* Based on a very old DW install that was using the default theme. Probably won't work for anything else.
* You will want to change some things if your wiki is installed anywhere other than /wiki/.
* Also check out the wp_insert_post() stuff to see if you want to change it.
*/

require 'wp-load.php';
require_once ABSPATH . 'wp-admin/includes/post.php';

// List of Index URLs (one for each namespace is required)
// These will be crawled, all pages will be listed out, then crawled and imported
$indexes = array(
	'http://urltodokuwiki.com/wiki/index?do=index',
	'http://urltodokuwiki.com/wiki/?idx=namespace',
);

$author = 1; // The user_ID of the author to create pages as

function dokuwiki_link_fix( $matches ) {
	return '<a href="/' . str_replace( '_', '-', $matches[1] ) . '" class="wikilink1"';
}

$imported_urls = array(); // Stuff we've already processed

$created = 0;
foreach ( $indexes as $index ) {
	echo "Crawling $index for page links...\n";
	$i = file_get_contents( $index );

	if ( !$i )
		die( "Could not download $index\n" );

	// Get index page and parse it for links
	preg_match( '!<ul class="idx">(.*)</ul>!sUi', $i, $matches );
	preg_match_all( '!<a href="([^"]+)" class="wikilink1"!i', $matches[0], $matches );

	$bits = parse_url( $index );
	$base = $bits['scheme'] . '://' . $bits['host'];

	// Now we have a list of root-relative URLs, lets start grabbing them
	foreach ( $matches[1] as $slug ) {
		$url = $page = $raw = '';

		if ( in_array( $slug, $imported_urls ) )
			continue;
		$imported_urls[] = $slug; // Even if it fails, we've tried once, don't bother again

		// The full URL we're getting
		$url = $base . $slug;
		echo "  Importing content from $url...\n";

		// Get it
		$raw = file_get_contents( $url );
		if ( !$raw )
			continue;

		// Parse it -- dokuwiki conventiently HTML-comments where it's outputting content for us
		preg_match( '#<!-- wikipage start -->(.*)<!-- wikipage stop -->#sUi', $raw, $matches );
		if ( !$matches )
			continue;

		$page = $matches[1];

		// Need to clean things up a bit:
		// Remove the table of contents
		$page = preg_replace( '#<div class="toc">.*</div>\s*</div>#sUi', '', $page );

		// Strip out the Edit buttons/forms
		$page = preg_replace( '#<div class="secedit">.*</div></form></div>#sUi', '', $page );

		// Fix internal links by making them root-relative
		$page = preg_replace_callback(
			'#<a href="/wiki/([^"]+)" class="wikilink1"#si',
			'dokuwiki_link_fix',
			$page
		);

		// Grab a page title -- first h1 or convert the slug
		if ( preg_match( '#<h1.*</h1>#sUi', $page, $matches ) ) {
			$page_title = strip_tags( $matches[0] );
			$page = str_replace( $matches[0], '', $page ); // Strip it out of the page, since it'll be rendered separately
		} else {
			$page_title = str_replace( '/wiki/', '', $slug );
			$page_title = ucwords( str_replace( '_', ' ', $page_title ) );
		}

		// Get last modified from raw content
		preg_match( '#Last modified: (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})#i', $raw, $matches );
		$last_modified = $matches[1];

		// Resolve parent if we're in a namespace
		$slug = str_replace( '/wiki/', '', $slug );
		if ( stristr( $slug, '/' ) ) {
			$parts = explode( '/', $slug );
			$slug = $parts[1];
			$parts[0] = str_replace( '_', '-', $parts[0] );
			$parent = get_posts( array( 'post_type' => 'page', 'post_status' => 'publish', 'name' => $parts[0] ) );
			if ( $parent ) {
				$parent = $parent[0]->ID;
			}
			else {
				// No parent found -- create a placeholder (will be an empty page with
				// the same last modified as the page we're working with).
				$post = array(
					'post_status'   => 'publish',
					'post_type'     => 'page',
					'post_author'   => $author,
					'post_parent'   => 0,
					'post_content'  => '',
					'post_modified' => $last_modified,
					'post_title'    => ucwords( str_replace( '-', ' ', $parts[0] ) ),
					'post_name'     => $parts[0],
				);

				$parent = wp_insert_post( $post );
				$created++;
				echo "    Created parent page for $url using $parts[0]\n";
			}
		} else {
			$parent = 0; // top-level page
		}

		$post = array(
			'post_status'   => 'publish',
			'post_type'     => 'page',
			'post_author'   => $author,
			'post_parent'   => $parent,
			'post_content'  => $page,
			'post_title'    => $page_title,
			'post_modified' => $last_modified,
			'post_name'     => str_replace( '_', '-', $slug ),
		);

		wp_insert_post( $post );
		$created++;
	}
}

echo "\nDone! Created $created pages in WordPress, based on your Dokuwiki install.\n";
    • Beau Lebens said:

      This is a complete standalone script. It needs to exist at the root level of your WordPress install and then you can either run it from the command line (as mentioned in the post), or you can just request that file via your web browser. Make sure you delete it once done. And do it on a test blog first.

Comments are closed.