早在1958年,汉斯·彼得·卢恩在他的论文《文献摘要的自动创建》中提出,"文章中词语出现的频率可以作为词语重要性的有用度量",这一观点直到现在仍然是信息检索科学中最重要的观点之一,并被所有知名的大小搜索引擎使用,从谷歌和雅虎到自定义搜索解决方案,如ElasticSearch和Manticore Search。
卢恩假设的重要性不言而喻,许多后续的信息检索科学研究都基于这一假设,尽管并非所有研究都明确提及这一点,可能是因为随着时间的推移,这已经成为公认的事实。
在本文中,我将尝试通过非常清晰的例子,展示TF(词频)及其对应的IDF(逆文档频率)如何帮助找到你想要的内容。
让我们以5个文档为例:
1. My dog doesn't like all my pets, but me and my wife do love them all. A dog cannot be a cat lover
2. Cats and dogs do not like each other, but my cat does like dogs
3. Walking a dog is a good start of the day
4. Not all cats like walking, but some cats do
5. All dogs like walking, but mine doesn't like. It's like so weird
并尝试对不同查询进行排名(即找出最相关和最不相关的文档,以及它们的顺序)。
权重 = 如果词语出现则为1,否则为0
查询:'like'。
请注意,我们不知道查询中的'like'是什么意思:是像"A像B"还是"我喜欢某物"。那么我们如何对文档进行排名?
首先想到的是,如果文档根本不包含'like'这个词,那么它可能相关性较低。因此排序可以是:
1. My dog doesn't <b>like</b> all my pets, but me and my wife do love them all. A dog cannot be a cat lover
2. Cats and dogs do not <b>like</b> each other, but my cat does <b>like</b> dogs
4. Not all cats <b>like</b> walking, but some cats do
5. All dogs <b>like</b> walking, but mine doesn't <b>like</b>. It's <b>like</b> so weird
3. Walking a dog is a good start of the day
我们将文档#3放在最后,因为它不包含'like'。至于其他文档,由于它们都包含'like',所以没有改变它们的顺序。这种排名的质量够好吗?我不这么认为,因为在前4个文档中,#5和#2似乎比其他文档更相关,因为它们提供了更多与查询词'like'相关的事实(记住我们不知道'like'具体指什么),但它们并不是前两名。那么我们该怎么办?
权重 = 词频
查询:'like'。
如前所述,根据卢恩的假设,包含更多查询词出现次数的文档可能更相关。让我们按照词频对文档进行排名(注意,在信息检索领域,"频率"通常只是指计数,而不是像物理学中那样除以某个值):
5. All dogs <b>like</b> walking, but mine doesn't <b>like</b>. It's <b>like</b> so weird | tf = 3
2. Cats and dogs do not <b>like</b> each other, but my cat does <b>like</b> dogs | tf = 2
1. My dog doesn't <b>like</b> all my pets, but me and my wife do love them all. A dog cannot be a cat lover | tf = 1
4. Not all cats <b>like</b> walking, but some cats do | tf = 1
3. Walking a dog is a good start of the day | tf = 0
这解决了问题,现在文档#5和#2位于顶部。
但现在让我们尝试另一个查询 - 'my day':
1. <b>My</b> dog doesn't like all <b>my</b> pets, but me and <b>my</b> wife do love them all. A dog cannot be a cat lover | tf = 3
2. Cats and dogs do not like each other, but <b>my</b> cat does like dogs | tf = 1
3. Walking a dog is a good start of the <b>day</b> | tf = 1
4. Not all cats like walking, but some cats do | tf = 0
5. All dogs like walking, but mine doesn't like. It's like so weird | tf = 0
文档#1和#2获得最高位置,但这正确吗?它们对'day'只字未提,只提到了'my':我的宠物,我的妻子,我的猫,这可能对查询很重要,但很可能不是。文档#3的位置较低,而它似乎是最相关的。我们如何将它提高?
让我们思考'my'与'day'和其他词的不同之处。'day'似乎更具体,而'my'几乎可以用于任何事物,并且在文档中'my'出现的频率远高于'day'和其他更具体的词。这里就涉及到了"文档频率" - 术语权重公式的另一个重要部分和第二个基石。
1972年,英国计算机科学家卡伦·斯帕克·琼斯在她的一篇学术论文中说:"文档描述的详尽性是其包含的术语数量,而术语的特异性是它所涉及的文档数量"。
我们已经涵盖了前一部分,现在让我们看看如果计算后一部分,文档顺序会如何变化。
权重 = 词频 * (文档总数 / 文档频率)
查询:'my day'。
'My'出现在2个文档中,'day'只出现在1个,所以'my'的IDF(逆文档频率)是5(文档总数)/ 2 = 2.5,'day'是5/1 = 5。
1. <b>My</b> dog doesn't like all <b>my</b> pets, but me and <b>my</b> wife do love them all. A dog cannot be a cat lover
Weight = 3(tf of 'my') * 2,5(idf of 'my') + 0(tf of 'day') * 5 (its idf) = 7,5
3. Walking a dog is a good start of the <b>day</b>
Weight = 0 * 2,5 + 1 * 5 = 5
2. Cats and dogs do not like each other, but <b>my</b> cat does like dogs
Weight = 2,5
4. Not all cats like walking, but some cats do
Weight = 0
5. All dogs like walking, but mine doesn't like. It's like so weird
Weight = 0
借助IDF,我们可以将相关文档"遛狗是一天的好开始"提高一个位置,但它仍然不在最顶部。原因是什么?那是因为'my'出现得太频繁,这抵消了'day'的IDF重要性。我们如何解决?我们需要降低词频(TF)的贡献。多年来,人们提出并研究了许多TF归一化的变体,从简单的log(TF)到所谓的"词频饱和曲线"。为简单起见,我们就采用对数变体。
权重 = log(1+TF) * (N/df)
对数平滑的整体思想是,参数越大,增长越慢,即对于较小的x增长很快,对于较大的x增长较慢。函数中的'1+x'是为了避免词频为0时参数为0,因为log(0)是未定义的,而log(0+1)是0,这正是我们对tf=0所需要的。
让我们看看它是否解决了问题:
查询:'my day'。
3. Walking a dog is a good start of the <b>day</b>
Weight = log(1+0) * 2,5 + log(1+1) * 5 = 0 + 3,47 = 3,47
1. <b>My</b> dog doesn't like all <b>my</b> pets, but me and <b>my</b> wife do love them all. A dog cannot be a cat lover
Weight = log(1+3) * 2,5 + log(1+0) * 5 = 3,47 + 0 = 3,47
2. Cats and dogs do not like each other, but <b>my</b> cat does like dogs
Weight = 1,73
4. Not all cats like walking, but some cats do
Weight = 0
5. All dogs like walking, but mine doesn't like. It's like so weird
Weight = 0
正如我们所见,最相关的文档最终位于第1位,但它与另一个权重相同的文档共享这个位置。这对不对?这是个难以确定的问题,因为我们永远无法确切知道请求者认为什么是相关的。有可能在询问'my day'时,他或她也会认为第二个文档是相关的,因为它为'my'提供了多个事实,而第一个文档只说了'day'的一个事实。所以它们获得相同权重可能是相当公平的。
正因为这种不确定性,搜索引擎通常不返回单一结果,而是返回10个或20个结果,让用户自己做最终决定。
结论
在本文中,我以简要的方式展示了TF-IDF权重公式思想的演进,证明了它的有效性,并且每个组成部分都很重要。
There's a continuation of the TF-IDF called BM25 which is based on the previously mentioned 'term frequency saturation curve' in particular and just much more probabilistic approaches to defining the weighting fomula (remember again we never know what is really relevant, hence the probabilistic approach). In general it still remains TF-IDF (+ normalization by document length), but its roots are completely different. It will be covered in another article. Stay with us and subscribe to our newsletters to not miss that.