Alex Strick van Linschoten

All the things I learned while trending on Hacker News

6 Jul 2024 | original ↗

My previous two blog posts — here and here — were trending / on the front page of Hacker News, driving over 20,000 new visitors to this blog. Welcome! I learned a few new tricks (and some mistakes I’d made) during the ensuing discussion so I thought I’d share some of these here. Some of them might trigger some mini side-investigations into...

How to think about creating a dataset for LLM finetuning evaluation

24 Jun 2024 | original ↗

I previously experimented with one-click LLM finetuning providers and now is a good time to return to the core of the matter: evaluating how well all these fine-tuned models and experiments are faring. I have a gut feeling that my fine-tuned models did pretty well, but we’re not in the business of gut feeling so I’m hoping to be able to put some...

One-click LLM finetuning with Predibase, OpenPipe and OpenAI

16 Jun 2024 | original ↗

The last post in this series showed that finetuning an LLM needn’t be particularly difficult. I used axolotl to produce finetuned versions of Llama3, Mistral and TinyLlama models. During the course we were given a bunch of credits by various companies in the LLM and finetuning space. Among those were credits from some finetuning-as-a-service...

Introducing the Afghanwire Dataset: A Unique Collection of Translated Afghan Media Articles from 2006-2009

31 Mar 2024 | original ↗

I am excited to announce the release of a new dataset on the Hugging Face Hub: the Afghanwire Dataset. This dataset is a comprehensive collection of translated Afghan media articles from the period of May 2006 to September 2009, created by the Afghanwire media agency, which I co-founded together with Felix Kuehn. During the years that Afghanwire...

Writing a custom Terraform provider to deploy Huggingface Spaces

30 Mar 2024 | original ↗

If you’re reading this blog, you’ve probably visited the Huggingface website and you’ve almost certainly tried out one of their ‘Spaces’. These are deployed mini-applications hosted on Huggingface infrastructure. I’ve created spaces of my own, and at work I added a way for people to quickly deploy a ZenML server as a ‘Space’. I love browsing all...

Publishing the ISAF Press Releases dataset

23 Mar 2024 | original ↗

Yesterday I published two datasets to the Hugging Face Hub and I wanted to briefly add some context to them and what they might be useful for. TL;DR: I wrote a paper in 2011 that used international military forces’ press releases about Afghanistan military operations to gain an understanding of what was going on on the ground. The paper was...

Automating database backups with Tarsnap

23 Jul 2023 | original ↗

Yesterday I wrote about my MathsPrompt tool which serves up questions to help me practice new skills I’m learning as part of my mathematics degree. Today I realised that all the data (both autogenerated and copy-pasted) is being stored in my database and that I hadn’t really given much thought yet to ensuring a long life for that data. I put...

Building MathsPrompt: a tool to help me review and practice problems for my degree

22 Jul 2023 | original ↗

TL;DR: I built a little app in Rust to help me revise and practice mathematics exercises. I input questions I’ve already studied / completed and the app gets more questions autogenerated via OpenAI’s GPT-4 API. All of this populates a database which is then queried to show me the questions and topic areas where I’m least confident. It’s just v0.1...

Terraform Input Variables

21 Jun 2023 | original ↗

When working with Terraform code, there are ways to take in user input at the time when you are applying whatever you’ve defined. To take a perhaps needlessly simple example, you might write a definition that allows you to deploy a new S3 bucket but you probably wouldn’t want to hardcode the name of the new bucket; instead, you’d rather take that...

Tokenizer Links

3 Jun 2023 | original ↗

This is just a collection of various links and observations that I came across while learning about tokenisation during the past week that would otherwise have no other home. NLTK and CLTK are two other NLP libraries from the pre-deep learning era. CLTK has a focus on classical languages, but my sense of NLTK is that it maybe hasn’t kept pace as...

Tokenizing Balochi with HuggingFace’s Tokenizer and FastAI/Spacy

2 Jun 2023 | original ↗

In this blog I want to walk through how I trained my first tokenizer(s) on a small Balochi language corpus. I used the Huggingface Tokenizers library and FastAI / Spacy to get a sense of the interfaces involved. There’s also some naive pre-processing I did to get the corpus into a format that the tokenizer could handle. I’m not sure if this is...

The What, Why, and How of Tokenisation in Machine Learning

31 May 2023 | original ↗

For the types of machine learning that involve neural networks, the training process generally involves passing data and some weights into a function which we use to continually and iteratively optimise the weights. We hope that by showing lots of examples of the right way to do things (as per our data and annotations) we’ll emerge with a model...

Building a Balochi Language Dataset for NLP Applications

28 May 2023 | original ↗

I’m working on building out some language models and utilities for the Balochi language. (Read previous posts in this series for the full context.) Even though there are some 8-10 million estimated speakers, it certainly falls into the category of being a ‘low-resource’ language. Many (most?) things that you’d take for granted when working with...

The Risks of Language Models in Minority Languages

21 May 2023 | original ↗

In thinking about my work to put together a language model or some utilities relating to the Balochi language, I thought a fair bit about whether I should even start. At a very high level, we can look at general risks that comes from language models, as highlighted in the 2022 Deepmind paper entitled “Taxonomy of Risks posed by Language Models”...

Low-resource language models: making a start with Balochi

20 May 2023 | original ↗

Large Language Models are all the rage, but what do you do when the language you want to model is essentially unrepresented in the public datasets used for training? I have a few months before the start of my next maths module and I’d like to use the time in part to dive into the ins and outs of training your own language models from scratch. The...

Finishing MU123

13 May 2023 | original ↗

I completed and submitted the final exam for the MU123 module that I’ve been studying since October last year. This module is the first step on my journey towards a BSc Mathematics degree from the Open University, something I’m really happy I have the time to do on the side of my full-time job. Mathematics was always something I enjoyed studying...

Alex Strick van Linschoten

Related blogs