I decided to do a real hack on the Wordpress WXR file first: Insert the CDATA escapes so that XML parsers don’t croak. That actually worked. With a 4.7MB XML, it took three minutes to parse the resulting XML (Ruby REXML), 100% pegged on one core. Firefox can parse it too, just slightly faster but not by much.
That explains why the WP developers don’t use a XML parser. Alarm gnomes at shared hosts wake up and grouch when too much CPU or clock time is given to one process. I know mine did and it was using the faster PHP method. Fact is the damn WXR file is too large to suck in.
I’m writing a WXR splitter. It’ll be handy to have around, just in case. Yet another state machine
Time to exhale in little pants of breath. Creating multiple small files is next. Testing the little files involves uploading to my test webserver, dumping and creating Databases from across the net is that goes much slower.
I’ll share the script, no license required when it’s working well enough for me. Your mileage may vary. The real life WXR I’m testing with reports 79 lines of preamble WXR stuff, 2 lines of post-amble stuff and 1594 posts of varying length and line count. 79 and 2 is correct and I know the data set and 1594 is about right. Probably exactly right but I have to test the schuff first.
[update Nov, 22, 2008]
The SplitWXR app is available Here. Pick the one with the highest version number, OK?. Pick the one for Windows (.exe), Mac OS X (.dmg) or Linux(.run). It’s unlikely you’ll have the Shoes gui framework and the correct version of Ruby so the app will download them and install them if you allow it. It does not do a system wide install. It’s known to work on Windows XP and Ubuntu Linux 8.04. It should work on OS X. I’d be happy to hear that it does. [There’s also the command line script ’splitWxr.rb’ for those who don’t want a gui)
It’s not the prettiest application but you’ll probably only use it once. Output files will be in the same directory (folder) and the input file. There are options for picking the size of each output file and the partial output filename (and there are reasonable defaults for each). It’s likely that it’ll split the file faster than the gui status pane will refresh.
Once you’ve got your new files uploaded to your website, you can safely delete the SplitWXR/Shoes/Ruby if you find the folder (it varies by OS so I don’t know where it is for you).
Hi there. I would LOVE to have a script to split the WXR. care to share it ?
Tal.
guess im the one have posted comment here.. im not a spammer whose leave the comment for crappy backlinks. Would i know why my comments gone ?
I have download your shoes (CMIWW) application before… but its not work anyway…
If theres a chance for me to try and test it .. would be nice i thing…
regards.
Anak,
There is an updated version of the shoes scripts so the output files are put in the correct folder.
http://www.mvmanila.com/public
VOILA….. The great and find things i have got from there…
Many thanks mate. And sure… i will publish it to my blog so it will be easier for anyone need some tools for spliting their wxr…
thanks.
Thank you for helping test the shoe shoes script and reporting errors.
OMG…thank you Cecil you “SplitWXR” was a lot…lot…lot of help to me…as i’ve been trying for 2 months on upload my freakin 4MB WXR file…thanx to you…am now able to get all my files all together again..
Thank you…and god bless you.