Large language models and Search

Microsoft’s made an aggressive and delightful splash in the search market by deeply integrating the guts of OpenAI’s large language model ChatGPT with Bing Search. There’s an impressive interview by Joanna Stern on the topic (hat tip to Daring Fireball for the link). There’s a potential there that’s amazing, and others that are truly frightening.

The part that’s super-cool to me is that the model, under the covers, is a sort of translation from text string to a point in a multi-dimensional space (_really_ big freaking numbers) that includes the surrounding text to build a context and tries to make a guess at the concept that the word is representing. But most interestingly, these modules aren’t only for a single language. They can be constructed from concepts from a LOT of languages (which is how you get the amazing translation capabilities these days). One of the benefits of this, in the context of “trying to find something” is that you don’t have to be constrained to content from a single language. A sort of multi-language Wordnet database, which would be super-cool, if challenging.

I’ve no idea if Bing (or Google’s vague-ish LAMDA/Bard response) does this at all, but it was something even I briefly looked into. When I was doing a bit of work to help improve search at Swift Package Index, one of the things I noticed was that while _most_ of the documents and README’s were in english, there were a notable set in mandarin. For the most part, those packages published a snippet or such in english to assist with any sort of “Is this what I want?” research, but a LOT of the detail was buried and opaque – you pretty much need to cut and paste a bunch of content into translate to have a hope of a loose translation. I took a bit of time and dug around to the multi-language models to see if there was something that was desktop-class usable that I could use to transform the content into some indexable content that might be searched. Most of these latest hotness in language models are the kinds of things that take clusters of computers to present, let alone run encode or run inference. I didn’t find anything directly obvious, and it turns out that’s a field of active research, but there wasn’t anything obviously available for use in that kind of format.

I’m not nearly as much of a fan of the synthesis based on a query though. The results, well – can lie with extreme confidence, and if you’re doing any critical thinking about the results, you’re right in a pile of muck. And doing critical thinking is something that a whole bunch of people don’t seem to be too adept at these days. So suffice to say, I think that’s a bit of an existential failure. And fundamentally, it feels like glorified computer-assisted plagiarism that’s somehow been anointed as acceptable.

So hopefully the good parts will be retained and used – and with any luck, there’ll be a corpus and model that’s published somewhere, by someone, that’s not JUST in the hands of a mega-corporation that can be used to bolster search and information finding in some of these smaller corner areas. I’d love to see something like that working to help me find relevant libraries for a specific kind of task or technique within the Swift Package Index.

Published by heckj

Developer, author, and life-long student. Writes online at https://rhonabwy.com/.

%d bloggers like this: