My Publishing Pipeline


I write an extensive personal knowledge base using markdown, code-server and a variety of other tools. Originally, in 2021, I wanted to have something like Obsidian Publish but self-hosted, so I created it.

Over time my knowledge base evolved more into a second brain, tracking not only my technical notes and journal, but also things like recipes and hikes. With this my publishing pipeline, and the script at it’s core, extended in a multitude of ways.

The pipeline

The CI pipeline doing the publishing is rather simple:

On every new commit GitLab CI runs the publishing script, placing a subset of my notes into a hugo folder structure. That hugo folder then gets built using hugo and published on GitLab Pages.

Things to note

  • Because the publishing script needs full git history for files I had to specifically tell GitLab CI to clone the full repository
  • As suggested by the GiLab Docs I pre-compress the generated site using gzip
  • I split the publishing script and hugo into separate jobs, since it’s easier to use specific python & hugo container images. Using artifacts and needs the two jobs are then tied together to make it all work.

Full configuration

collect_notes:
  stage: build
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
  image: python:3.9-bullseye
  variables:
    # Workaround to get full repository history, see https://gitlab.com/gitlab-org/gitlab/-/issues/292470#note_599414923
    GIT_STRATEGY: clone
    GIT_DEPTH: 0

  script:
    # - pip3 install -r .ci-requirements
    - pip install python-frontmatter
    - python .scripts/publish_brain.py --notes --hikes
  artifacts:
    paths:
      - .publish/content/notes
      - .publish/static/img/notes
      - .publish/content/hikes
      - .publish/static/static/hikes

pages:
  stage: deploy
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
  needs:
    - job: collect_notes
      artifacts: true
  image: registry.gitlab.com/pages/hugo:0.114.1
  variables:
    GIT_SUBMODULE_STRATEGY: recursive
  script:
    - cd .publish
    - hugo
    - mv public ../
    - cd ..
    # Generate compressed versions of all relevant resources
    # See https://docs.gitlab.com/ee/user/project/pages/introduction.html#serving-compressed-assets
    - find public -type f -regex '.*\.\(htm\|html\|txt\|text\|js\|css\)$' -exec gzip -f -k {} \;
    # TODO: Make brotli available in this image
    # - find public -type f -regex '.*\.\(htm\|html\|txt\|text\|js\|css\)$' -exec brotli -f -k {} \;
  artifacts:
    paths:
      - public

The publishing script

I added the full publishing script below, at it’s core it works like this:

For a specific set of folders, find all notes marked published: True in frontmatter, transform and enrich their frontmatter data, like title, modification date and so on. Also find links to embedded files, mostly images, and place those files in the right place for publication, rewrite their path in markdown. Finally, write out the transformed markdown files for publishing.

Things to note

  • The class NotePublisher at the core of the script is quite configurable, because I plan to use it for more things in the future It’s instantiation is also the only place with any hard coded values.
  • The script expects files to be tracked by git, it will probably break if they are not, see def set_dates(note, note_file):

Full script

#!/usr/bin/env python3
"""Extract a subset of notes based on frontmatter,
then transform and place them to be published by hugo

Required packages:
- python-frontmatter
"""
import argparse
import frontmatter as fm
import os
import re
import shutil

from pathlib import Path

def set_title(note):
    """Set the notes title"""
    # Find and remove the first h1, remembering it for later
    content_list = note.content.splitlines()
    title = ""
    item_to_remove = None
    for index, line in enumerate(content_list):
        if line.startswith("# "):
            title = line.replace("# ", "")
            item_to_remove = line
            break

    # Remove the h1 from the note
    if item_to_remove:
        content_list.remove(item_to_remove)
    note.content = "\n".join(content_list)

    # If no title is set in frontmatter take the h1 from earlier and set it as the title
    if not note.metadata.get("title"):
        note.metadata["title"] = title

    return note

def set_dates(note, note_file):
    """Set the notes date & lastmod"""

    # Add date to metadata based on git information if it is not set
    if not note.metadata.get("date"):
        # This is extremely hacky and would break if dates get a character longer
        creation_date = os.popen(
            f"git log --diff-filter=A --follow --format=%as -- {note_file} | tail -1"
        ).read()[0:10]
        if len(creation_date) >= 10:
            note.metadata["date"] = creation_date

    # Add lastmod to metadata based on git information if it is not set
    if not note.metadata.get("lastmod"):
        # This is extremely hacky and would break if dates get a character longer
        modification_date = os.popen(
            f"git log --diff-filter=M --follow --format=%as -- {note_file} | head -1"
        ).read()[0:10]
        if len(modification_date) >= 10:
            note.metadata["lastmod"] = modification_date
            print(f"Added lastmod date: {modification_date}")

    return note

class NotePublisher:

    def __init__(self, note_folders, destination_folder, asset_folder, asset_uri_base, dynamic_type=True, asset_properties=None):
        self.note_folders = note_folders
        self.destination_folder = destination_folder
        self.asset_folder = asset_folder
        self.asset_uri_base = asset_uri_base
        self.dynamic_type = dynamic_type
        self.asset_properties = asset_properties if asset_properties else list()

        os.makedirs(self.destination_folder, exist_ok=True)
        os.makedirs(self.asset_folder, exist_ok=True)

    def publish(self):
        """Publish all notes"""
        for note_path, note in self.get_notes():
            # Construct the final location for the file in the destination
            note_destination = self.destination_folder / str(note_path)

            # Create the destination folder recursively if it is missing
            destination_folder = "/".join(str(note_destination).split("/")[:-1])
            os.makedirs(destination_folder, exist_ok=True)

            # Dump the processed note to the destination
            with open(note_destination, "w+") as destination:
                destination.write(fm.dumps(note))

            # Process required assets
            for asset_src in note.metadata.get("required_assets", []):
                asset_dst = self.asset_folder / asset_src
                os.makedirs(asset_dst.parent, exist_ok=True)
                shutil.copyfile(asset_src, asset_dst)

    def get_notes(self):
        """Generator yielding paths and processed notes for publishing"""
        for note_folder in self.note_folders:
            for item in note_folder.rglob("*md"):
                if item.is_file():
                    with open(item) as f:
                        note = fm.load(f)
                        # Skip notes that should not be published
                        if not note.metadata.get("published", False):
                            continue

                        print(f"Publishing {item}")

                        # Find the h1 and remove it from the file
                        note = set_title(note)

                        # Add a dynamic type so we can filter what notes are shown on the index by hugo
                        # This dynamic type is the first folder name relative to the project
                        if self.dynamic_type:
                            note.metadata["type"] = item.parts[0]

                        # Add an empty summary on hike types as a workaround for not showing text on the list view
                        if note.metadata["type"] == "hike":
                            note.metadata["summary"] = " "

                        note = set_dates(note, item)

                        # Extract and rewrite asset links
                        note.metadata["required_assets"] = list()
                        asset_refs = re.findall("!\[.*?\]\((.*?)\)", note.content)
                        for ref in asset_refs:
                            # If the asset is hosted elsewhere we dont touch it
                            if ref.startswith("http"):
                                continue

                            # Construct a full path to the referenced asset
                            asset_path = item.parent / ref

                            # Skip if the asset does not exist
                            if not asset_path.is_file():
                                print(f"Skipping asset, not found at {asset_path}")
                                continue

                            # Record the assets full source path in metadata
                            note.metadata["required_assets"].append(str(asset_path))

                            # Update the path to the asset in the note to what it will be when published
                            old_ref = f"({ref})"
                            new_ref = f"({self.asset_uri_base + str(asset_path)})"
                            print("Rewriting asset path", old_ref, new_ref)
                            note.content = note.content.replace(old_ref, new_ref)

                        # Rewrite properties containing assets
                        for asset_property in self.asset_properties:
                            ref = note.metadata.get(asset_property, None)

                            # If the property is not defined, skip it
                            if not ref:
                                continue

                            # If the asset is hosted elsewhere we dont touch it
                            if ref.startswith("http"):
                                continue

                            # Construct a full path to the referenced asset
                            asset_path = item.parent / ref

                            # Skip if the asset does not exist
                            if not asset_path.is_file():
                                print(f"Skipping asset, not found at {asset_path}")
                                continue

                            # Record the assets full source path in metadata
                            note.metadata["required_assets"].append(str(asset_path))

                            # Update the path to the asset in property
                            new_ref = self.asset_uri_base + str(asset_path)
                            print("Rewriting asset path in property", asset_property, ref, new_ref)
                            note.metadata[asset_property] = new_ref

                    yield item, note

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--notes", action="store_true", default=False, help="Process notes subset")
    parser.add_argument("--hikes", action="store_true", default=False, help="Process hikes subset")

    args = parser.parse_args()

    # All paths assume our current working directory is at the root of the project
    BASE_PATH = Path(".")

    if args.notes:
        print("--- Publishing notes ---")
        publisher = NotePublisher(
            note_folders=(
                BASE_PATH / "technology",
                BASE_PATH / "misc",
            ),
            destination_folder=BASE_PATH / ".publish/content/notes/",
            asset_folder=BASE_PATH / ".publish/static/img/notes/",
            asset_uri_base="/img/notes/",
            dynamic_type=True,
        )
        publisher.publish()

    if args.hikes:
        print("--- Publishing hikes ---")
        publisher = NotePublisher(
            note_folders=(BASE_PATH / "hikes",),
            destination_folder=BASE_PATH / ".publish/content/",
            asset_folder=BASE_PATH / ".publish/static/static/",
            asset_uri_base="/static/",
            dynamic_type=False,
            asset_properties=("gpx",),
        )
        publisher.publish()

See also