Hello, PHP-free world
·514 words·3 mins
All my websites (
skowron.ski,
skowronski.pro and
skowronski.tech that replaces blog.dsinf.net
) are PHP-free. They are based on
Hugo and hosted on S3 with only dynamic content being
plausible.io rewrite done by CloudFlare worker scripts.
Quick braindump on how to export Wordpress:
- first, run https://github.com/SchumacherFM/wordpress-to-hugo-exporter in CLI and download created ZIP
- extract articles to relevant Hugo path
- remove drafts and peform some basic cleanups (like ‘category: no category’)
- to convert mess I had with my set of plugins for images I wrote below script that fixes one exported markdown article at a time:
# [<img decoding="async" loading="lazy" src="http://blog.dsinf.net/wp-content/uploads/2012/08/rondo1-300x225.png" alt="rondo1" width="300" height="225" class="alignnone size-medium wp-image-191" srcset="https://blog.dsinf.net/wp-content/uploads/2012/08/rondo1-300x225.png 300w, https://blog.dsinf.net/wp-content/uploads/2012/08/rondo1.png 640w" sizes="(max-width: 300px) 100vw, 300px" />][1]
import re, sys
file = open(sys.argv[1], "r")
lines = file.readlines()
for i in range (0, len(lines)):
l = lines[i]
html_matches = re.search('(<img .*? />)', l)
if html_matches:
html = html_matches.groups()[0]
print(html)
alt_matches = re.search('alt="(.*?)"', html)
if alt_matches:
alt = alt_matches.groups()[0]
else:
alt = ""
srcset_matches = re.search('srcset="(.*?)"', html)
src_matches = re.search('src="(.*?)"', html)
if srcset_matches:
srcset = srcset_matches.groups()[0]
src_descs = srcset.split(", ")
src = src_descs[0].split(" ")[0]
for src_desc in src_descs:
canidate_src = src_desc.split(" ")[0]
if len(canidate_src)<len(src):
src = canidate_src
elif src_matches:
src = src_matches.groups()[0]
else:
continue
md = "!["+alt+"]("+src+")"
l = l.replace(html, md)
lines[i] = l
print(l)
data = lines
file.close()
file = open(sys.argv[1], "w")
file.writelines(data)
file.close()
- after that, replace all URLs leading to old site (full URIs with FQDN) to local ones (e.g. using
gsed -i 's|https://blog.dsinf.net/wp-content/uploads/|/wp-content/uploads/|g' *
) - locate all files referenced from markdown, script I wrote for that parses one file at a time from markdown to XML and extracts all image sources and link targets:
# https://stackoverflow.com/a/30737066/12297075
import sys
import markdown
from lxml import etree
file = open(sys.argv[1], "r")
lines = file.readlines()
urls = []
for line in lines:
try:
doc = etree.fromstring(markdown.markdown(line))
for a in doc.xpath('//a'):
# print('LINK:', a.text, a.get('href'))
urls.append(a.get('href'))
for img in doc.xpath('//img'):
# print('IMG: ', img.get('alt'), img.get('src'))
urls.append(img.get('src'))
except:
pass
file.close()
for url in urls:
try:
if url.find('/wp-content/uploads/')==0:
print(url)
except:
print(url)
- run that against all files:
for file in *.md; do python3 extract_links.py $file; done | sort | uniq | tee urls.txt
- for all missing files attempt copy from hugo-export:
# start in dir with posts
cd ../../../
mkdir -p 2012 2013 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
for y in *; do mkdir -p $y/01 $y/02 $y/03 $y/04 $y/05 $y/06 $y/07 $y/08 $y/09 $y/10 $y/11 $y/12; done
cd -
for file in `cat urls.txt`; do cp -rv "../../../../../exports/attemp_01/hugo-export/$file" ../../../$file ; done | tee log.txt
- OLD:
and move them from exported path to ./static/wp-content/uploads/
, for example using htmltest on hugo
generated public/
dir
https://github.com/stevenvachon/broken-link-checker and something like:
curl http://127.0.0.1:1313/sitemap.xml | grep '<loc>' | sed 's/.*<loc>//g' | sed 's/<\/loc>//g' | sort | uniq > links.txt
sed -i 's|localhost|127.0.0.1|g' links.txt
for link in `cat links.txt | grep '/20[0-9][0-9]'`; do blc $link --host-requests 512 --requests 512 --exclude https://skowron.ski/ --exclude https://skowronski.pro/ --exclude https://linkedin.com/ --exclude https://blog.dsinf.net/ --exclude https://foto.dsinf.net/ --exclude https://keybase.io/dskowronski --exclude https://gohugo.io/ --exclude https://nunocoracao.github.io/blowfish/ --exclude https://github.com/danielskowronski/; done | tee ../../../../../results.txt
cat ../../../../../results.txt | grep "BROKEN" | awk '{print $2}' | grep "http://127.0.0.1" | sed 's|http://127.0.0.1:1313/wp-content/uploads/||g' > missing.txt