Podcast: Play in new window | Download
Subscribe: Apple Podcasts | Google Podcasts | Spotify | Android | TuneIn | RSS

Most of us are learning about AI on the fly and just got started in the past year or two.
Paco Nathan has been working with AI since the 1980s and has been doing digital business nearly as long.
His background in both the technical and commercial sides of artificial intelligence gives him a unique perspective on the field that can help newcomers like me and you get oriented to this new landscape.
We talked about:
- his extensive history in the AI field, including work with some of the earliest chatbots
- how graphs can serve as a way to ground and contextualize unstructured content
- how content that is structured properly can help help users and drive action
- the tech stack underlying the current generation of AI tools
- two technologies at the base level of the stack: sequence-to-sequence and diffusion
- the benefits of SSM, small specialized models, over LLMs
- his take on the impact of LLM chat agents on content and editorial practice
- four take-homes from his recent immersion in AI conferences and gatherings:
- the superiority of small, specialized learning models (SSMs) over LLMs
- the issue of losing domain knowledge as experts age and retire
- the importance of using your own data sets
- the need for detailed task analysis as you begin building any AI model
- the contrasts and interplay between AI developments at large, well-funded entities like Alphabet, Meta, and Microsoft and the smaller, more diverse ecosystem around open-source AI projects
Paco’s bio
Paco Nathan, Managing Partner at Derwen, Inc., and author of Latent Space, along with other books, plus popular videos and tutorials about machine learning, natural language, and related topics. Known as a “player/coach”, with +40 years tech industry experience, ranging from Bell Labs to early-stage start-ups. Werner Herzog is his spirit animal.
Board member for Argilla.io; Advisor for KUNGFU.AI. Lead committer on PyTextRank, kglab. Formerly: Director, Community Evangelism for Apache Spark at Databricks.
Long, long ago, when the world was much younger, Paco led a media collective / indie bookstore / performance art space / large online community called FringeWare. Beginning in 1992, this was one of the first online bookstores and likely the first commercial use of a chat bot on the Internet.
Connect with Paco online
Video
Here’s the video version of our conversation:
Podcast intro transcript
This is the Content and AI podcast, episode number 2. You’d think from news stories and social media that AI is mostly about large language models like ChatGPT and big companies like Microsoft and Google. In fact, there’s a large, well-established community of open-source AI projects and a variety of technologies in addition to LLM-based chat agents. With more than 40 years of experience in artificial intelligence and in the tech business world, Paco Nathan is uniquely qualified to orient us in the current AI landscape.
Interview transcript
Larry:
Hey everyone, welcome to episode number two of the Content and AI Podcast. I’m really happy today to welcome to the show Paco Nathan. Paco, we could talk literally for 20 hours about this stuff we’re going to talk about today. But what Paco and I are going to talk about today just kinda get you grounded in making sense of the AI ecosystem. Paco’s been doing this stuff forever. He’s studied AI back in the, what, the 80s or something like that. Anyhow, welcome, Paco. Oh, and one last quick thing. Paco is the managing director of Derwen.ai, his consulting company. So welcome, Paco. Tell the folks a little bit more about your work at Derwen.ai and some of your discoveries around AI lately.
Paco:
Oh, fantastic. Thank you very much, Larry, I really appreciate it. Yeah, Derwen, we’re really focused on open-source integration to support machine learning in general. But we focus a lot on natural language and graph technologies. And for what it’s worth, I got into doing graph work, which is how we met, I got into that because of natural language. I was working with a family of algorithms. There’s some research that had come out of, basically, taking a raw text and being able to start to put structure into, and turn it into a graph by using natural language.
Paco:
So we ended up using, like I say, these kinds of technologies, mostly in open-source, for enterprise customers. Really, to help power, help them build applications of knowledge graph, and now large language has become very popular. And, been doing this for a while.
Paco:
One of the projects I’m involved with, there is an open-source project called Argilla, based in Spain. I’m in Spain right now, actually. We started six years ago in natural language, when some of the first open-source large language models were coming out, LLMLP templates, things like that. Argilla has been doing a lot of those open source integration paths with Spacy and other kinds of NLP projects. But using them in enterprise, like I say, for the past six years. It’s been interesting because, five years ago, we decided, we made a business decision to focus on large language models in enterprise business use cases. Back then, people would be like, “Language models? That seems very narrow. Why do you want to focus on this?”
Larry:
Yeah, but you’re not an “I told you so” kind of guy, but you still must have some fun with it. Yeah. Well, that’s great.
Larry:
And what you were just saying, too, about the natural language stuff and the graph stuff, there’s this … One of the things we’ve seen lately is the large language models, which are notoriously hallucination prone and not very bright, being merged with techniques like retrieval augmented generation to access a knowledge source, like a knowledge graph or something like that.
Paco:
Like a knowledge graph.
Larry:
Yeah. So that gets into the ecosystem part of this. It’s not ChatGPT all the time. As you just said, you’ve been doing this way before they came on the scene. I’d love to get just your quick overview of that ecosystem. There’s the natural language, ML flavored stuff, the knowledge representation stuff. What else is relevant, especially in terms of content practice, do you think?
Paco:
Well certainly, we can also talk about the ecosystem more, but let’s first focus on where the building blocks are. Obviously, a lot of people are interested in chat, we’ll touch on that later. You know, I’ve been working with chat apps for a long time. Going back to the early 80s, that’s what we used for our class projects. I used to TA a course that Andrew Ing eventually took over and made popular. But we would teach Eliza to people doing chatbots, back in the early 80s.
Paco:
Yeah. Not all the world is chat. There is a lot of the world that has to do with text, and images, and video. And a lot of the text is structured in ways … For instance, we work a lot with manufacturers here in Europe. And you might think that manufacturing data is all about process controls, and factories, and automation. It’s not. The stuff that we work with, and so much of the important data is all PDF documents. Because you’ve got patent applications from your possible competitors, you’ve got environmental impact reports where your competitors might disclose. You’ve got scientific papers are being published that you have to keep up on. You’ve got your regulatory norms that you’re publishing to European Commission, or whatever. And, just on, and on, and on. You end up with hundreds of millions of PDF documents.
Paco:
To be able to use those, you can’t really do much with that in a data lake, you have to process it. So you need to use NLP to extract out the information and the relationships. And then, the next step is gosh, this is all linked. The scientific papers are referencing things that are also in the patent applications, and that has a lot to do with our competitors’ factories. And by the way, if we’ve got thousands of vendors in a network, at the end of the day, you end up with a very large graph. This is how you make sense of it. This is how you rationalize it, is by grounding in a graph. Which is, like you say, with retrieval augmented generation, yeah, people are realizing, “A knowledge graph might be good for grounding our data.”
Larry:
Yeah. And the way you just talked about that, too. There’s so much of the fuss in the content world is around generative AI, and just creating content. But you just talked about just one use case for the understanding what you’ve got already. There’s huge power in that. In fact, I just saw a paper the other day where somebody … I don’t know exactly which technology there at play, but a lot of the technologies just take random 500 character or 500 word chunks of a document.
Paco:
Right.
Larry:
And this was a new technique to take the inherent, implied semantic meaning of headings in a PDF doc and do the chunking that way to get better results.
Paco:
Oh, yeah.
Larry:
But that’s just an optimization of this kind of stuff. So there’s both the analytical understanding part of what you go. When I look at a lot of those PDFs I’m like, “Oh, for crying out loud. Why didn’t you hire me 10 years ago?” I think this might get to what you’re talking about with those knowledge graphs, that having a structure, and meaning, and semantic attributes of stuff – before you create it, that’s just a pet project of mine. How can AI help with that kind of stuff? Like workflows, maybe, around that.
Paco:
Yeah. Well, it really cuts both ways. And actually, I’m here in A Coruña doing a talk about this. It’s about this intersection of graphs and language for industry AI applications. So it’s really cutting both ways.
Paco:
On the one hand, usage, it’s good to be using graphs to organize things because, when you think of ChatGPT or any of the chatbots, it’s kind of a flat experience. It’s just like text out, text out. There’s a lot more structure. And especially if you’re in business, you would like to have your content lead to, for instance, maybe some conversions or some sort of action. That’s typically why we do content marketing. So you don’t want just this flat experience. If you’re generating something to use, it would be nice to have a lot of links behind that, and understand how they play together. So where you’re trying to drive the audience toward might be some sort of conversion on a website, but there might also be some background information that’s useful to help them.
Paco:
So when you’re using AI tools to come up with some text, rather than just generating some flat text, why don’t you generate something that’s actually chunked out into a graph? And that way, they can pull the thread, and follow it for a lot more information later on. And that’s useful both for public messaging, but also for internal, for people understanding how to use work documents for the thing you need to do.
Paco:
So we’re seeing a shift there. There’s been a lot of really good info coming out of LlamaIndex. We’ll talk about them some more. But they’ve got some of the best tutorials online for, really, “Here’s 20 lines of Python that you can do a retrieval augmented generation with the knowledge graph as the grounding.” I think, more and more, we’re seeing a lot of that as being a path forward.
Larry:
Yeah. Yeah, and I think … Actually, as we talk about this, because I think there’s very few of my … I know one guy, a former poet, who now writes this Python coding in his content design practice. But that’s a small subset of my audience, I think, especially of content practitioners. So I think mostly, or very often, you’re going to be higher up in that stack. I think we’re both grounded in this stack, but I want to share it with the audience, too.
Larry:
There’s these technical underpinnings, the deep, underlying technologies that drive all the various kinds of AI. And there’s a layer above that, which is what, I guess business practices, and integration, and orchestration, and whatever.
Larry:
And then, the model building and all that stuff. And then, there’s the application layer, which I think most of my folks will probably be at that application layer, I’m thinking. But, I want to understand, and I hope others want to understand as well, what do we got under the hood there?
Paco:
Yeah. Let’s build up the stack real quick. So for the current generation of AI, when people talk about large language models, or diffusion, or a lot of what you’re seeing, both for the text based side, and for the image based side, and video, and now music and others, all of that, there’s really two fundamental things going on in terms of technology.
Paco:
One of those is we’ve suddenly become much, much better at what we would call sequence-to-sequence. So I’ve got a sequence of input, maybe it’s a mixture of text and video, or sound, or however. And then, I’ve got a sequence of output. And I can train so that, if you give me a partial on the input, I can fill out the details. If I start out a sentence, and I type two or three words, here’s the next most likely four ways of concluding that sentence. You can think of that, but it gets a lot more interesting if you start to mix media together. Like you have some images, and some sounds, and some text, and it’s like, “Okay, give me a complete picture. What am I actually looking at here? And describe it, break it down.”
Paco:
So you can think of sequence-to-sequence as being this machine learning approach, but then it becomes really interesting when you start to say, “Okay, let’s break this into steps, and explain to me at each step, what are the exemplars? What are you referencing? What’s the prior thing that made a decision in step one?” That might be some really interesting evidence for the case that you’re making in whatever you’re writing. And then, chain step-by-step together to come out with your end result. So that’s interesting because it can be used to develop policy, for instance, and backup why you’re saying, “Take these steps.”
Paco:
There’s something called chain of thought prompting, which allows you to break out the answers of sequence-to-sequence into steps. Princeton did an iteration on that called tree of thought, like, “Let’s make a decision tree out of it.” And then, Zurich came out with something called graph of thoughts, where they’re like, “Let’s do contingency planning. Let’s look at the cost of calling the APIs needed for this, and let’s build out a graph and try to optimize the cost of generating our results. But if we get stuck at any point, we know we can backtrack and have a contingency.” I just saw another one from Stamford this morning that’s called abstract reasoning. No, no, no. Analogous reasoning. Where they were, in the prompt that you’re putting in, you’re providing the model’s input, you’re actually embedding, “And here’s some comments about how I want you to structure the output.”
Paco:
So there’s a lot of ways of leverage the sequence-to-sequence to come out with more structured results. So it’s not just a matter of generating chat text, it can be much more substantial.
Paco:
But, there’s another thing also that’s been going on. It started out GANs, with generated adversarial networks, and now it’s being done with large language models. It’s called diffusion.
Paco:
So the simple idea is, suppose you have a picture, something that’s interesting, some kind of image that you need. You know, take and knock out a few pixels at random, just with noise. And then, do that again. And do it in a sequence of noise-ification, if you will, until you’ve just got random noise. And now, turn that around so that you can start from a noisy state, and then see the trajectory of getting back to a complete picture. Now, if you repeat that hundreds of millions of times, and train a large model based on that, what that means is if you have a really noisy, staticky image, you can take and sharpen it up, and make it into something that actually is coherent.
So if you think of those two ideas, of being able to do diffusion, sort of run the noise backwards, and be able to do sequence-to-sequence, which is planning out a strategy and executing on it, at the base level, that’s what’s going on.
Larry:
That’s so interesting. As you talk about that, conceptually those are simple enough ideas. Well, not simple, but easy enough ideas to get your head around. The math underlying that just breaks my brain. My sister does GAN stuff in her work and I can’t even talk to her about it. But, anyhow. That points, again, back to that stack. This is all going on underneath. How do you think this’ll manifest as applications that content folks will be likely to be interacting with?
Paco:
Well, I’ve just come off some talks, and we can summarize about that later. I’ve come off some other conferences in AI. But, one of the things that we see out of industry, I mean heavy industry where they’re making really serious applications of these kind of technologies, one of the things that they’re saying is they’re very interested in SSMs. Small, specialized models.
Paco:
The idea of large language models, yes, theoretically very important. But, when you actually go to use stuff, you probably shrink it down. You probably start with a smaller model than what might be considered state-of-the-art at Google. And then, you train it on your own data. And then you probably have a lot of these running together, for a given application, fact checking each other if you will. So small, specialized models.
Paco:
The idea is that, to make use of the kind of technology they talk about, like sequence-to-sequence and diffusion, you can think of applications where you’re using a number of different specialized models that you’ve built for your application, and it might be a range of things. It might be, gosh, I’ve seen this used for things as disparate as, say, steel mills and fishing in Japan. There’s just a range of different kind of applications.
Paco:
One of the most fascinating ones I saw was going out on the boats, individual fishermen, most of whom are fairly old, who know every little nook and cranny off the coast of Japan. They know, at a given day in the year, at a given time, what kind of fish species they’re going to haul up, and how old each one will be, and whether or not that’s good for the catch. Maybe some of these, we want to let them grow first because it’s bad for business in the long run to not to sustainable practices. It was amazing to me, at this one conference, this industrial conference, you have a lot of people from different parts of the world, that you start talking about fishing in Japan, and the folks from Japan get very passionate about that. Also, I love sushi so I’ll second it.
Larry:
Well, that-
Paco:
So there’s a lot of different areas where this can be used.
Larry:
Yeah. And that one, as you’re talking about that, that’s actually quite … I’ll add a third use case to that, another disparate thing you can do there, is that a pretty common thing in various kinds of content design and content strategy practice is to consult subject matter experts and try to ensconce their expertise in different ways in your content systems. It’s not hard to imagine, from what you just said, you talk to the guys on the fishing boats and they’re like, “Oh yeah, you do that this way.” You could do that through business rules, or however you ensconce that in your systems. That seems super powerful. Can you think of other examples like that? Are there other-
Paco:
Oh, sure. Well, I think that, getting back to your point here for content, a lot of what I see is yeah, ChatGPT came out, it was a demo. It’s going to gain a lot of buzz and headlines. And yes, you can use chat to generate text, so gosh, all the writers can be put out of business, and then a lot of hand wringing. That was all noise, it was so spurious. Specious, really.
Paco:
The way the stuff is being used, these models, is much more to the point of, say editorial, than it is just generating a content. Getting a lot of content, I used to work in publishing, I used to manage editors, and I’m also an author, so I know for a fact that the world has way too much content already. It’s just not the good stuff. Generating more bad stuff is really not an issue, we have no problem in doing that already. Just go check any of the movies that are flying through Hollywood, there’s a lot of really bad stuff at the top of the pipeline.
Paco:
But being able to take something that’s really important, and then really hone it down. Like, cut your word count down, and really tie it to message and, “Here are the talking points we’re trying to hit on.” Historically, because we’re an organization and we’ve been doing this, and we have a strategy that we’re executing on, those kinds of editorial decisions are where I see this AI technology coming in. And it’s exactly your point. The subject matter, the domain experts are critical.
Paco:
And in fact, when you look at manufacturing, I could boil down four points of what I’ve heard over the past month at the conferences. One of the main ones I talked about already was, yeah large language models are interesting, but in practice you’re using a lot of small, specialized models. That’s one point. And even high end mathematics empirical analysis coming out of Berkeley for how to use this stuff, they’re saying the same thing. They’re saying, “You can beat the price point by using a bunch of small, specialized models as opposed to using one large model.” At any scale, just published something about that.
Paco:
The second point that feeds into it. Well, if you’ve got specialized models, that means your domain experts are pretty important, right? And getting the domain experts in to help build up your datasets, it’s crucial. There’s a snag, though, and you see it very much in manufacturing. You see it especially in Asia, but worldwide when I talk with people. The thing that industry’s concerned about with AI is that the actual SMEs, the domain experts are aging out of the population rapidly. People my age are thinking, “I’d like to retire.” I’m looking here at Northwest Spain and thinking, “This would be a nice place to retire.” The thing is, it’s the same if you go to Japanese fishing fleets, or to steel mills in Pittsburgh. They’re going to tell you, “The most crucial problem that we face that keeps our executives up at night is the people who actually know how to keep this plant running are in their early 60s, and they’re not going to be around at our company much longer.”
Paco:
They absolutely are crucial to make AI work. How can we get them engaged and get that domain expertise in there? They’re really not the people who are worried about having their jobs replaced because yeah, they’ll probably retire before that would happen. But the other side of it, too, is we’re talking about jobs that can’t be articulated in a handbook. To be able to know when to shut down a furnace in a steel factory, it takes 10 years. 10 years of standing 10 feet away from molten steel to be able to hear when there’s a problem that would cause you to have to do maintenance on a furnace. And if you do it wrong, you lose millions of dollars a day. But if you don’t do that maintenance when it’s needed, you might lose hundreds of millions of dollars.
Paco:
It’s that domain expertise that is just so hard to train somebody from a manual, it takes years and years of experience. That’s the crucial part of AI. It’s not some chatbot. We see this in every industry, I’ve heard it everywhere. Sometimes, in places where you would think, “They’ve got a chemical factory. What’s their big problem?” Their big problem is actually, it’s the people who know how to actually merge the databases together because you’ve got generations of mergers and acquisitions, each time they’ve come in with difference databases. Somebody has to spend 10 years just to figure out how to trace down the product SKUs. That’s actually their biggest problem.
Larry:
Yeah. I just want to say that you’re reminding me that there’s the domain expertise and the domain of content everything. Content design, engineering, management, operations. There’s probably content ops people, like Rahel Bailie, and people like that, who are just so attuned.
Paco:
Yeah.
Larry:
In our job, maybe that’s part of our job is there’s a whole meta content operations thing by getting that knowledge out of their brain and into the system. But hey, those were two. The LLMs versus SSMs, and the domain expertise. What are the other two take-homes you’ve had?
Paco:
Okay. Well, one big one is that your data, your datasets that you curate, it’s extremely important, because your data is going to be used. Maybe you’re going to train a model, if you have enough. If you have enough money, as in hundreds of millions of dollars to train a large model. Probably more likely, you’re going to take an existing model and then fine tune it. Again, if you have enough data. If you don’t, then you probably would use a technique like Few-Shot, which is like training. But in any case, you’re definitely going to need to evaluate whatever you’re building, so you need your data for that.
Paco:
And this point about hallucination. The idea is if you can ground the results in your actual data, your embeddings from your own data, as opposed to some public dataset. Then, what gets generated is going to be more faithful to what you need.
Paco:
So there’s at least five reasons for having high quality work on your own datasets from your domain experts. And having that be a central focus, because if you’re going to use this in business, you absolutely have to have your data tuned. That’s the thing that we really landed on with Argilla. Actually, our contact is argilla.io. But we’re an open source, but we provide tooling to be able to bring in your domain experts. And when you can find problems in what the models are doing, trace back to where to fix it, and then go in and get that human loop feedback. Tuning your datasets is super crucial.
Larry:
Yeah, the way it-
Paco:
The fourth-
Larry:
Oh, I’m sorry. Go ahead. You have the fourth thing. Yeah, that’s right.
Paco:
Well, the fourth point, just before I miss it there. And I’ve heard this from business, not so much from the research side. But from everybody whose making ROI on AI applications, I keep hearing the same thing iterated. What you do is you do task analysis upfront, and you focus very narrow focus on tasks that you understand very well, that your people are repeating over and over, and you look at how to augment those tasks. Because by definition then, you will be able to have data to decide if you’re doing it well or not. And if you don’t keep that narrow task focus, this all becomes just so amorphis. It goes sideways in a hurry.
Paco:
I’m referencing Brynjolfsson from Stamford HAI. There’s a really interesting character there, somebody who’s an economist at Stamford, but also has hands-on building AI models. You would think, “If a person were that central, what would they do?” Brynjolfsson has launched a startup called Workhelix, which does task analysis for enterprise. Before you go in and do your AI applications, let’s chart this out. Where are the touchpoints? That’s been my own practice, too, is really narrow focus on tasks. To recap. Okay, task analysis, dataset quality, domain experts, and small, specialized models.
Larry:
Yeah. Especially on that last point, there’s a common practice in information architecture a lot of content people do called top task analysis that this guy, Gerry McGovern, came up with. But that’s for our clients, and now we need to do that for ourselves as well.
Paco:
It’s also a process.
Larry:
It can help in both. I’m sure that we can do this … Hey, Paco, I can’t believe we’re coming up … I knew this would happen.
Paco:
I know, I know.
Larry:
It always goes way, way, way, way too fast. Yeah, I hope we can connect again. I’m sure I want to have you on the show 20 times, if you’re game. Hey, before we wrap though, is there anything last? Anything that we haven’t got to or that you just want to make sure we share before we wrap up this episode?
Paco:
Okay, I mentioned contact info. We’re Derwen.ai. D-E-R-W-E-N.ai. And then, Argilla is A-R-G-I-L-L-A.io. That’s contact for a couple things I mentioned here I’m involved with.
Paco:
The one big takeaway I have is I think, on one side, you’ve got the arms race. Microsoft has really good stuff going on, especially in product definition, they’re really good. Research, though, they were falling behind Alphabet and Meta. It was very clear the AI research was falling behind and that seems to be a lot of the motivation for them to prop up OpenAI. Really, I think it’s pointed at Alphabet. That’s my own cynical takeaway is most of what they’re doing is taking shots at the leaders. Having said that, Google has really amazing technology internally. We’re not going to see most of it, it’s not going to see the light of day right away. But, there’s an arms race going on.
Paco:
On the other side, though, there’s this enormous ecosystem of open source. And Hugging Face is really the central player there that has been propping up the open source ecosystem in AI. What they appear to be doing behind the scenes is they will analyze and find out where’s the friction in the open source AI model ecosystem, and then they provide services to alleviate that friction. So behind the scenes, they’re orchestrating. There’s a lot of building blocks. You go to Hugging Face, there’s 350,000 models on Hugging Face. You can go run them on Hugging Face Spaces. There’s a lot of building block components, we really didn’t get into a lot of the nuts and bolts.
Paco:
But using models to augment a special task, like for instance, “I’ve got a PDF. Can I pull out the main entities that are being talked about in my PDF?” That’s a task. Once I’ve got two entities, can I say, “What are the relationships between them?” Here is King Edward, here is King Edward III. Is there a grandfather thing going on between them? These are ideas of tasks that you can apply to your data and get some really interesting semantic kinds of information out of it. Both of those are a couple of open source projects on Hugging Face that you can find. If you look at SpanMarker or OpenNRE, would do those things respectably.
Paco:
In this whole realm of text, and more structured text, and images and videos, music, on, and on, and on, we’re seeing really amazing things to do with this kind of sequence-to-sequence and with the diffusion played backwards. We can’t really talk about it all in 30 minutes, but I hope we touched on some.
Larry:
That’s why I want to hold an invitation to have you back because it all changes so quickly, too. Actually, I’m kind of struck that yeah, I kind of changes, but Paco was there six years ago doing this stuff. It’s interesting that there’s some underlying threads. So I want to keep stitching those threads back together as all this stuff unfolds. Well, thank you so much, Paco. This was a super fun conversation.
Paco:
Thank you, Larry. I really enjoyed talking with you. This was great.