I write an extensive personal knowledge base using markdown, code-server and a variety of other tools. Originally, in 2021, I wanted to have something like Obsidian Publish but self-hosted, so I created it.
Over time my knowledge base evolved more into a second brain, tracking not only my technical notes and journal, but also things like recipes and hikes. With this my publishing pipeline, and the script at it’s core, extended in a multitude of ways.
The pipeline
The CI pipeline doing the publishing is rather simple:
On every new commit GitLab CI runs the publishing script, placing a subset of my notes into a hugo folder structure. That hugo folder then gets built using hugo and published on GitLab Pages.
Things to note
- Because the publishing script needs full git history for files I had to specifically tell GitLab CI to clone the full repository
- As suggested by the GiLab Docs I pre-compress the generated site using
gzip
- I split the publishing script and hugo into separate jobs, since it’s easier to use specific python & hugo container images.
Using
artifacts
andneeds
the two jobs are then tied together to make it all work.
Full configuration
collect_notes:
stage: build
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
image: python:3.9-bullseye
variables:
# Workaround to get full repository history, see https://gitlab.com/gitlab-org/gitlab/-/issues/292470#note_599414923
GIT_STRATEGY: clone
GIT_DEPTH: 0
script:
# - pip3 install -r .ci-requirements
- pip install python-frontmatter
- python .scripts/publish_brain.py --notes --hikes
artifacts:
paths:
- .publish/content/notes
- .publish/static/img/notes
- .publish/content/hikes
- .publish/static/static/hikes
pages:
stage: deploy
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
needs:
- job: collect_notes
artifacts: true
image: registry.gitlab.com/pages/hugo:0.114.1
variables:
GIT_SUBMODULE_STRATEGY: recursive
script:
- cd .publish
- hugo
- mv public ../
- cd ..
# Generate compressed versions of all relevant resources
# See https://docs.gitlab.com/ee/user/project/pages/introduction.html#serving-compressed-assets
- find public -type f -regex '.*\.\(htm\|html\|txt\|text\|js\|css\)$' -exec gzip -f -k {} \;
# TODO: Make brotli available in this image
# - find public -type f -regex '.*\.\(htm\|html\|txt\|text\|js\|css\)$' -exec brotli -f -k {} \;
artifacts:
paths:
- public
The publishing script
I added the full publishing script below, at it’s core it works like this:
For a specific set of folders, find all notes marked published: True
in frontmatter,
transform and enrich their frontmatter data, like title, modification date and so on.
Also find links to embedded files, mostly images, and place those files in the right place for publication, rewrite their path in markdown.
Finally, write out the transformed markdown files for publishing.
Things to note
- The class
NotePublisher
at the core of the script is quite configurable, because I plan to use it for more things in the future It’s instantiation is also the only place with any hard coded values. - The script expects files to be tracked by git, it will probably break if they are not, see
def set_dates(note, note_file):
Full script
#!/usr/bin/env python3
"""Extract a subset of notes based on frontmatter,
then transform and place them to be published by hugo
Required packages:
- python-frontmatter
"""
import argparse
import frontmatter as fm
import os
import re
import shutil
from pathlib import Path
def set_title(note):
"""Set the notes title"""
# Find and remove the first h1, remembering it for later
content_list = note.content.splitlines()
title = ""
item_to_remove = None
for index, line in enumerate(content_list):
if line.startswith("# "):
title = line.replace("# ", "")
item_to_remove = line
break
# Remove the h1 from the note
if item_to_remove:
content_list.remove(item_to_remove)
note.content = "\n".join(content_list)
# If no title is set in frontmatter take the h1 from earlier and set it as the title
if not note.metadata.get("title"):
note.metadata["title"] = title
return note
def set_dates(note, note_file):
"""Set the notes date & lastmod"""
# Add date to metadata based on git information if it is not set
if not note.metadata.get("date"):
# This is extremely hacky and would break if dates get a character longer
creation_date = os.popen(
f"git log --diff-filter=A --follow --format=%as -- {note_file} | tail -1"
).read()[0:10]
if len(creation_date) >= 10:
note.metadata["date"] = creation_date
# Add lastmod to metadata based on git information if it is not set
if not note.metadata.get("lastmod"):
# This is extremely hacky and would break if dates get a character longer
modification_date = os.popen(
f"git log --diff-filter=M --follow --format=%as -- {note_file} | head -1"
).read()[0:10]
if len(modification_date) >= 10:
note.metadata["lastmod"] = modification_date
print(f"Added lastmod date: {modification_date}")
return note
class NotePublisher:
def __init__(self, note_folders, destination_folder, asset_folder, asset_uri_base, dynamic_type=True, asset_properties=None):
self.note_folders = note_folders
self.destination_folder = destination_folder
self.asset_folder = asset_folder
self.asset_uri_base = asset_uri_base
self.dynamic_type = dynamic_type
self.asset_properties = asset_properties if asset_properties else list()
os.makedirs(self.destination_folder, exist_ok=True)
os.makedirs(self.asset_folder, exist_ok=True)
def publish(self):
"""Publish all notes"""
for note_path, note in self.get_notes():
# Construct the final location for the file in the destination
note_destination = self.destination_folder / str(note_path)
# Create the destination folder recursively if it is missing
destination_folder = "/".join(str(note_destination).split("/")[:-1])
os.makedirs(destination_folder, exist_ok=True)
# Dump the processed note to the destination
with open(note_destination, "w+") as destination:
destination.write(fm.dumps(note))
# Process required assets
for asset_src in note.metadata.get("required_assets", []):
asset_dst = self.asset_folder / asset_src
os.makedirs(asset_dst.parent, exist_ok=True)
shutil.copyfile(asset_src, asset_dst)
def get_notes(self):
"""Generator yielding paths and processed notes for publishing"""
for note_folder in self.note_folders:
for item in note_folder.rglob("*md"):
if item.is_file():
with open(item) as f:
note = fm.load(f)
# Skip notes that should not be published
if not note.metadata.get("published", False):
continue
print(f"Publishing {item}")
# Find the h1 and remove it from the file
note = set_title(note)
# Add a dynamic type so we can filter what notes are shown on the index by hugo
# This dynamic type is the first folder name relative to the project
if self.dynamic_type:
note.metadata["type"] = item.parts[0]
# Add an empty summary on hike types as a workaround for not showing text on the list view
if note.metadata["type"] == "hike":
note.metadata["summary"] = " "
note = set_dates(note, item)
# Extract and rewrite asset links
note.metadata["required_assets"] = list()
asset_refs = re.findall("!\[.*?\]\((.*?)\)", note.content)
for ref in asset_refs:
# If the asset is hosted elsewhere we dont touch it
if ref.startswith("http"):
continue
# Construct a full path to the referenced asset
asset_path = item.parent / ref
# Skip if the asset does not exist
if not asset_path.is_file():
print(f"Skipping asset, not found at {asset_path}")
continue
# Record the assets full source path in metadata
note.metadata["required_assets"].append(str(asset_path))
# Update the path to the asset in the note to what it will be when published
old_ref = f"({ref})"
new_ref = f"({self.asset_uri_base + str(asset_path)})"
print("Rewriting asset path", old_ref, new_ref)
note.content = note.content.replace(old_ref, new_ref)
# Rewrite properties containing assets
for asset_property in self.asset_properties:
ref = note.metadata.get(asset_property, None)
# If the property is not defined, skip it
if not ref:
continue
# If the asset is hosted elsewhere we dont touch it
if ref.startswith("http"):
continue
# Construct a full path to the referenced asset
asset_path = item.parent / ref
# Skip if the asset does not exist
if not asset_path.is_file():
print(f"Skipping asset, not found at {asset_path}")
continue
# Record the assets full source path in metadata
note.metadata["required_assets"].append(str(asset_path))
# Update the path to the asset in property
new_ref = self.asset_uri_base + str(asset_path)
print("Rewriting asset path in property", asset_property, ref, new_ref)
note.metadata[asset_property] = new_ref
yield item, note
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--notes", action="store_true", default=False, help="Process notes subset")
parser.add_argument("--hikes", action="store_true", default=False, help="Process hikes subset")
args = parser.parse_args()
# All paths assume our current working directory is at the root of the project
BASE_PATH = Path(".")
if args.notes:
print("--- Publishing notes ---")
publisher = NotePublisher(
note_folders=(
BASE_PATH / "technology",
BASE_PATH / "misc",
),
destination_folder=BASE_PATH / ".publish/content/notes/",
asset_folder=BASE_PATH / ".publish/static/img/notes/",
asset_uri_base="/img/notes/",
dynamic_type=True,
)
publisher.publish()
if args.hikes:
print("--- Publishing hikes ---")
publisher = NotePublisher(
note_folders=(BASE_PATH / "hikes",),
destination_folder=BASE_PATH / ".publish/content/",
asset_folder=BASE_PATH / ".publish/static/static/",
asset_uri_base="/static/",
dynamic_type=False,
asset_properties=("gpx",),
)
publisher.publish()