Skip to main content
[] tech blog [] tech blog
  1. Posts/

Hello, PHP-free world

·514 words·3 mins

All my websites (, and that replaces are PHP-free. They are based on Hugo and hosted on S3 with only dynamic content being rewrite done by CloudFlare worker scripts.

Quick braindump on how to export Wordpress:

  • first, run in CLI and download created ZIP
  • extract articles to relevant Hugo path
  • remove drafts and peform some basic cleanups (like ‘category: no category’)
  • to convert mess I had with my set of plugins for images I wrote below script that fixes one exported markdown article at a time:
# [<img decoding="async" loading="lazy" src="" alt="rondo1" width="300" height="225" class="alignnone size-medium wp-image-191" srcset=" 300w, 640w" sizes="(max-width: 300px) 100vw, 300px" />][1]  

import re, sys

file = open(sys.argv[1], "r")
lines = file.readlines()

for i in range (0, len(lines)):
	l = lines[i]
	html_matches ='(<img .*? />)', l)

	if html_matches:
		html = html_matches.groups()[0]


		alt_matches ='alt="(.*?)"', html)
		if alt_matches:
			alt = alt_matches.groups()[0]
			alt = ""

		srcset_matches ='srcset="(.*?)"', html)
		src_matches ='src="(.*?)"', html)
		if srcset_matches:
			srcset = srcset_matches.groups()[0]
			src_descs = srcset.split(", ")
			src = src_descs[0].split(" ")[0]
			for src_desc in src_descs:
				canidate_src = src_desc.split(" ")[0]
				if len(canidate_src)<len(src):
					src = canidate_src
		elif src_matches:
			src = src_matches.groups()[0]

		md = "!["+alt+"]("+src+")"

		l = l.replace(html, md)
		lines[i] = l

data = lines

file = open(sys.argv[1], "w")
  • after that, replace all URLs leading to old site (full URIs with FQDN) to local ones (e.g. using gsed -i 's||/wp-content/uploads/|g' *)
  • locate all files referenced from markdown, script I wrote for that parses one file at a time from markdown to XML and extracts all image sources and link targets:

import sys
import markdown
from lxml import etree

file = open(sys.argv[1], "r")
lines = file.readlines()

urls = []

for line in lines:
		doc = etree.fromstring(markdown.markdown(line))
		for a in doc.xpath('//a'):
			# print('LINK:', a.text, a.get('href'))
		for img in doc.xpath('//img'):
			# print('IMG: ', img.get('alt'), img.get('src'))


for url in urls:
		if url.find('/wp-content/uploads/')==0:
  • run that against all files: for file in *.md; do python3 $file; done | sort | uniq | tee urls.txt
  • for all missing files attempt copy from hugo-export:
# start in dir with posts

cd ../../../
mkdir -p 2012 2013 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
for y in *; do mkdir -p $y/01 $y/02 $y/03 $y/04 $y/05 $y/06 $y/07 $y/08 $y/09 $y/10 $y/11 $y/12; done
cd -

for file in `cat urls.txt`; do cp -rv "../../../../../exports/attemp_01/hugo-export/$file" ../../../$file ; done | tee log.txt
  • OLD:

and move them from exported path to ./static/wp-content/uploads/, for example using htmltest on hugo generated public/ dir and something like:

curl | grep '<loc>' | sed 's/.*<loc>//g' | sed 's/<\/loc>//g' | sort | uniq > links.txt
sed -i 's|localhost||g' links.txt
for link in `cat links.txt | grep '/20[0-9][0-9]'`; do blc $link  --host-requests 512 --requests 512 --exclude --exclude --exclude  --exclude --exclude --exclude --exclude --exclude --exclude; done | tee ../../../../../results.txt 
cat ../../../../../results.txt | grep "BROKEN" | awk '{print $2}' | grep "" | sed 's|||g' > missing.txt