A Brief History of my Readwise Enrichment Pipeline
I read a lot. Like. A lot a lot.
And most of it is through an app called Readwise Reader. (Not an affiliate link. Possibly not even a recommendation. More on that maybe later?)
Without getting too deep down the rabbit hole, I think it’s the best currently available solution to the me-shaped set of problems, but I don’t find it delightful and I would abandon them the instant a better app came along.
That said, a better app hasn’t come along, doesn’t appear to be on the horizon, and the Reader app has been slowly getting better at my specific use cases for a couple years now. So it’s probably where I’m going to be for the foreseeable future.
I’m going to hand wave a lot of complexity about how I actually use Readwise because it’s not the point of the present conversation. The important thing to know is that eventually, through dark and internecine processes, some subset of documents come to their final rest in my Readwise archive.
I was going to do a header thing based on Genesis but I think I’ve already done that at least twice but you know what? It’s my blog. Ain’t nobody gonna tell me to stop making bible jokes. So anyway…
First, there was Step Functions, and it was… Okay?
Wait.
fuck.
Is this going to make any sense if I don’t explain what it was I was even trying to do when I built the thing?
Okay. So for realsie reals. First, there was the “what was I even trying to do, actually?”
My first Read it Later app was Instapaper. And it was also fine. In some ways it was better. And I was glad to pay like, what, $20/yr? Then they decided to more than double the price to add nebulous “AI Features” at a time when I was not yet sold on the value of AI as a thing I would pay double for.
(Side note, I still won’t pay double for it. This may be hubris but I suspect I am better at building AI features than you, for any company whose main thing isn’t building AI features. Just give me an API or at least a way to bring my own keys and stop trying to charge me double for things I didn’t ask to buy.)
So, incensed, I migrated to Readwise Reader. Who also had AI features. Which were legitimately awful at the time. (Sorry Readwise folks. It’s true. And I emailed you about it. And your features got better, which I appreciate. And you let me bring my own key, which I also appreciate. So thank you.)
And the important thing about that is the way Instapaper worked was everything I read ended up in my Instapaper archive, which meant at least a few years of backlogged “I looked at this, it was interesting,” was sitting in there. Would I ever look at any of it again? Turns out, benefit of hindsight, no. I wouldn’t. But all of that baggage came along when I exported Instapaper into Readwise.
How on earth was I going to separate the signal from the noise?
So I built an Enrichment Pipeline
The basic idea was simple. Take the archived file, get ChatGPT to generate a summary of it. Use the summaries to determine what I wanted to keep. Because ain’t no way I was ever going to re-read all 500+ documents to do that myself. (narrator: this is foreshadowing.)
I explained a lot of this here.
V1 ran on AWS Step Functions, output markdown to S3, which I used rclone to move into my Obsidian vault. And it was fine, except that I did a horrible job of ever actually looking at any of those markdown files. Not never. But not as often as I intended.
Turns out I had a lot of other mountains to move before I would be in a space (emotionally, mentally, physically,) that would let me start to get value from them.
And then I got there. And then I realized. Dumping all that into my Obsidian vault wasn’t actually the play. But I needed to do that, and work with them to understand my real needs, my real constraints. So it was a valuable learning experience.
This is a recurring theme, and a methodology I stand by. Stop worrying about building the right thing. If it’s not obvious what that is, build anything, and the right thing will eventually make itself known.
A brief aside on Digital Sovereignty
I ain’t want Google to own all my data. There. I said it. Or Apple. Or Amazon. Or any of them what thinks they built the future.
They didn’t build it. Kids with beards in computer labs built it, and then FAANG stole it and Google sits on a throne of lies which would be awesome if they were still don’t be evil about it.
So you know what? I’m taking all my data and I’m GOING HOME. You want to play tetherball? You get your own pole Sergei. You can’t come over and watch cartoons and eat pop tarts with me anymore. And also my dad could beat up your dad. At math.
And that’s why I don’t want to play in Amazon’s back yard
Which meant I needed to get my enrichment pipeline off Step Functions.
It was fine. It basically worked. But I wanted it to live on my own infrastructure.
Which Begat V2, n8n + git
I migrated my python + lambda + step function monstrosity to n8n, which has a perfectly cromulent self-hosted version. Except what I don’t have, even though I know I could, is a self-hosted equivalent to S3. I was going to need to rethink the rclone part. How would I get the enriched files from n8n to my computer?
Well. You know what else I’m self hosting? Forgejo!
So I set up a repo for my Readwise enrichment, figured out how to get Forgejo to commit the newly-enriched files to a branch and open a pull request for me to review.
And it worked great for about four days before I noticed that sometimes, for reasons I still don’t understand, Readwise will consider a document “recently updated” and trigger a re-enrichment. And that in so doing, I created a bunch of difficult-to-reconcile changes to files I was potentially actively changing in Obsidian.
However, because the universe has a sense of irony, I had a long weekend precipitated by a canceled work trip, which meant lots of unplanned time to rethink the entire thing. From scratch.
Which Begat V3, n8n + Readwise API
Remember that part where I said the Readwise Reader product has slowly gotten better? One of the ways it got better in the last six months is they have webhooks now. I didn’t need to pull their API nightly to look for newly-archive documents, I could respond to them nearly immediately.
You know what else they have? A native Obsidian plugin. That’ll just do the export directly. And is pretty good, actually, assuming that the data you want to enrich actually lives in Readwise. Which it didn’t. Until now.
You see the thing I had originally started doing with my unexpectedly free long weekend was finally working with some of those files that the V2 pipeline was creating. Putting my own thoughts and ideas in them.
As I was doing that, I developed a sense of what parts of the enrichment were actually useful. And it turns out, if I tilt my head slightly to the side and squint, I could cram that data back directly into the Reader data model.
When a document hits my archive, that triggers a webhook which causes my self-hosted n8n to pick it up and ship it to OpenAI to extract the key points. I find this super useful. A bulleted list of 3-5 main points is useful in a document that might take an hour to read.
It would be strictly better if I generated that list myself, but since I’m usually reading on an iPad mini in situations where it wouldn’t make sense to have a keyboard, the UI/UX on typing that list sucks, actually, and AI is good at this kind of summarization. So I let ChatGPT take a first stab at it.
I cram that back in the “document note” that every Reader document gets, and which I’ve otherwise never used. Again, because the typing experience on the iPad sucks. Seriously. Great reading device. Do not ask me to input data. No. Shan’t.
Then, if the document has highlights (and most of them do), I send each individual highlight back to OpenAI to generate a single one or two-word tag based just on the highlight. No additional document context. I attach that tag to the document, because the Readwise Reader API won’t let me attach it to the highlight. I suspect I could through the Readwise API, but it appears that the Readwise API uses a different ID to reference highlights than the Reader API (at least based on documentation) and I think it’d be difficult to bridge that gap, so I don’t.
Hey. Readwise team. You know what would be cool? If the IDs are meant to be the same between the two APIs, document that somewhere. If they’re different, give me an API that will let me turn one into the other. Thx.
The other thing I do is generate a “summary” of each article, but that’s actually something Reader’s built-in AI will do, as long as you give it an OpenAI key. So I do, and it does a good job.
Now, all I have to do is export the summary, document note, and tags via the Readwise Obsidian plugin, and I get my enriched data with arguably fewer moving parts. And it works great.
Except…
Okay so it’s a webhook now. Which means it’s event driven. Which means the whole process only runs when… I… Archive… Oh no.
So that’s the other thing I’ve been doing all weekend. According to my time tracking, I spent at least 6 hours yesterday alone. Going through each and every document in my Reader archive and CULLING.
Remember the not re-reading all 500 documents? Y’all, I’ve culled so hard.
To be fair, that process mostly isn’t re-reading the entire document. But in many cases it’s at least opening and reading the first / last few paragraphs to remind myself what the document was about.
Fortunately, lots of stuff could just go. At least several tens of documents about COVID that are no longer extremely relevant. Stuff from early AI that’s now been proven or disproven. Interests I’ve moved on from. Let’s say of the 500, probably about 75-100 were easy first-round deletes.
The rest?
Open the document. Does it have highlights? Do I find those relevant and interesting? Send it back to the archive.
Do I remember what it’s about? Does it seem interesting? Maybe send it back for a re-read.
Delete it if there’s any question. My to-be-read is long, and I trust that things will find me again if I need them.
I’ve got about a third of the original documents left to go, and a decent system in place. The enrichment pipeline has been working great, and the exports are now landing in my Obsidian vault in a format that I really like, with data I find useful.
Not bad for a couple extra unplanned weekend days, huh?