Evaluation, Topic Search


Overview | ProBase | Snapshots | DF-ITF | Evaluation

Our analysis on the Bing search log during the period of September of 2007 to June of 2009 shows that about 62% of the queries contain at least one concept term. More detailed analysis revealed that common web queries can be classified into a number of different patterns. The following five basic patterns account for the majority of all the Bing queries during that period:

1. Single Entity (E)
2. Single Concept (C)
3. Single Entity + Attributes (E+A)
4. Single Concept + Attributes (C+A)
5. Single Concept + Keywords (C+K)

These patterns can be combined to form more complex patterns. In the paper, we focus on one of them:

Concept + Keywords + Concept (C+K+C)

To evaluate the performance of online query processing, we create a set of benchmark queries that contain concepts, entities, and attributes, for example, "politicians commit crimes" (C+K+C), "large companies in chicago" (C+K), "president washington quotes" (E+A), etc. The first 6 tables show the queries we used. 10 queries for each pattern. (E), (C), (E+A) and (C+A) queries are randomly selected from Bing's search log, so the rank column shows their rankings by frequency.

One assumption we made in the paper is that we can estimate the association between an entity term and a keyword using simple two-way word association. We described this in detail in the paper. Here we take the 10 benchmark C+K queries, substituting the concept for each of them to generate a set of E+K queries, and highlight the pivot word we found in some of these E+K queries. The last table shows them.

  • Single Entity (E)

    These 10 queries are randomly selected from Bing's two-year search log. Freq. is the corresponding query's frequency and Ranking is its frequency ranking. (The same as below )

    # E Queries Freq. Ranking
    E-1 (house beautiful) 27285 83899
    E-2 (borland) 15628 146134
    E-3 (witco) 2366 911523
    E-4 (hicksville) 1408 1490513
    E-5 (alan taylor) 751 2691832
    E-6 (condobolin) 654 3060796
    E-7 (george low) 216 8569491
    E-8 (pigmy love circus) 199 9235273
    E-9 (still bill) 117 14965256
    E-10 (kip hanrahan) 99 17425678

  • Single Concept (C)

    # C Queries Freq. Ranking
    C-1 [cars] 2286411 528
    C-2 [online services] 23729 96622
    C-3 [american artists] 12421 183182
    C-4 [boutiques] 10815 209807
    C-5 [classic fairy tales] 7154 314209
    C-6 [british authors] 3311 662012
    C-7 [red sox players] 1404 1493864
    C-8 [italian composers] 716 2813231
    C-9 [media conglomerates] 297 6384077
    C-10 [mainstream movies] 105 16564335

  • Single Entity + Attributes (E+A)

    # E+A Queries Freq. Ranking
    EA-1 (pedro infante) <music> 658 3043603
    EA-2 (franklin) <time> 84 20166776
    EA-3 (assurant) <employees> 70 23591037
    EA-4 (king arthur) <music> 58 27870316
    EA-5 (phil harris) <age> 30 50799136
    EA-6 (david beckham) <quote> 28 53243346
    EA-7 (chennai) <place> 17 85110446
    EA-8 (president washington) <quotes> 16 87769744
    EA-9 (aaron neville) <album> 10 132744902
    EA-10 (peanuts) <artist> 10 132744902

  • Single Concept + Attributes (C+A)

    # C+A Queries Freq. Ranking
    CA-1 [famous people] <birthdays> 2116 1012793
    CA-2 [common allergies] <symptoms> 1424 1474277
    CA-3 [movies] <quotes> 853 2389110
    CA-4 [professional boxers] <champions> 145 12310179
    CA-5 [violinists] <tool> 28 52969723
    CA-6 [films] <soundtracks> 21 68772582
    CA-7 [images] <cameras> 21 68772582
    CA-8 [herbal supplements] <energy> 13 109177365
    CA-9 [nba players] <position> 12 113769727
    CA-10 [funds] <home page> 11 126590746

  • Single Concept + Keywords (C+K)

    # C+K Queries
    CK-1 [east asian countries] with nuclear capability
    CK-2 [american cities] sigmod
    CK-3 [large companies] in chicago
    CK-4 las vegas [outdoor activities]
    CK-5 [international organizations] focus on environmental protection
    CK-6 horse [medical conditions]
    CK-7 [name brands] in chinese market
    CK-8 [football players] own goal
    CK-9 [astronauts] fly to the moon
    CK-10 [famous people] bribery

  • Concept + Keywords + Concept (C+K+C)

    # C+K+C Queries
    CKC-1 [companies] buy [tech companies]
    CKC-2 [politicians] commit [crimes]
    CKC-3 [extreme sports] in [asian countries]
    CKC-4 [database conferences] in [european cities]
    CKC-5 [presidents] graduated from [universities]
    CKC-6 [rivers] flow into [seas]
    CKC-7 [actors] marry [actresses]
    CKC-8 [football stars] join [football teams]
    CKC-9 [cars] owned by [celebrities]
    CKC-10 [peoples] believe in [religions]

  • Pivot words

    # Pivot Word in Each Query
    CK-1 (pr china) with nuclear capability
    (republic korea) with nuclear capability
    (north korea) with nuclear capability
    CK-2 (san diego) sigmod
    (washington dc) sigmod
    (ann arbor)1 sigmod
    CK-3 (general electric) in chicago
    (american express) in chicago
    (microsoft corp) in chicago
    CK-4 las vegas (american football)
    las vegas (water sport)
    las vegas (nascar race)
    CK-5 (family health international) focus on ...
    (world health organization) focus on ...
    (the pan american health organization) focus on ...
    CK-6 horse (high blood pressure)
    horse (low blood sugar)
    horse (skin cancer)
    CK-7 (new balance) in chinese market
    (calvin klein) in chinese market
    (hewlett packard) in chinese market
    CK-8 (paul robinson) own goal
    (graham alexander) own goal
    (jonathan woodgate) own goal
    CK-9 (john glenn) fly to the moon
    (neil armstrong) fly to the moon
    (michael collins) fly to the moon
    CK-10 (james brown) bribery
    (george bush) bribery
    (bill clinton) bribery
    1Both are pivot words.