[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Search engines



> eine Frage: Wie und wo kann man sich ueber die Verlaesslichkeit und
> ueberhaupt ueber die Funktionsweise der verschiedenen "Suchmaschinen"
> informieren? 
Es gibt einige wenige vergleichende Uebersichten ueber search engines
im Netz, die wohl zum groessten Teil in dieser Liste schon erwaehnt
waren. Keine davon kann als vollstaendig oder auf wissenschaftlichem
Niveau stehend bezeichnet werden. An mehreren bibliotehkswissenschaftlichen
Instituten ist man aber dabei, solche Studien durchzufuehren.

>Z. B.: Bei Lycos wird eine Eingabe A B als A "oder" B umgesetzt -
> - nach welchen Kriterien werden dann aber die Treffer bei der Anzeige
> sortiert? Manchmal hatte ich schon den Eindruck, dass als erste die
> angezeigt werden, die einer Suche mit A "und" B entsprechen - stimmt das?
Lycos arbeitet hauptsaechlich mit "probabilistischem Retrieval", neuerdings
kann vom Formular aus aber auch Bool'sche Suche durchgefuehrt werden.
Vgl. den untenstehenden Auszug (Search software:)aus meiner Zusammenstellung 
vom Februar zu einigen wichtigen search engines.
Nur wenige Entwickler von search engines publizieren die exakte Arbeits-
weise ihrer Software oder die ranking algorithm's.

> Oder spielt die Haeufigkeit der "Inanspruchnahme" einer URL dabei eine
> Rolle? 
Nein.

> Zweite Frage: Trifft meine Annahme zu, dass in Lycos jedes Wort einer
> jeden Datei "ausgewertet" wird?
"What is indexed: document titles, headings, links, content: 100 most
"weighty" words (using an algorithm which considers word placement and
frequencies, among other factors) from the documents; first 20 lines; size in
bytes and number of words." vgl. unten

---
aus: Searching the Web - Systematic overview over indexes
http://www.ub2.lu.se/tk/websearch_systemat.html
(Im Original finden sich auch die Links zu Lycos' eigener Dokumentation).

1 Indexes for all types of Web resources 

A Singular search engines:

A1 Spider/robot/wanderer/crawler based indexes:

Title: Lycos.The Catalog of the Internet


Publisher: Carnegie Mellon University's Center for Machine Translation
History: Announced June 1994
Volume: Feb. 17th Lycos catalog (1.89 million unique URLs found between
Nov. 21 and Feb. 16th, including 318,464 documents actually retrieved)(cf.
Lycos News). 338,236 URLs and their links, 3039 WWW servers, 573
WWW Home Pages, 4158 Gopher servers, 547 FTP servers 
Type coverage: random choice from URL references to HTTP, Gopher and
FTP in the start resources; registering and deleting pages possible;
preferences for documents with multiple links into them and with shorter
URLs
Geographic coverage: see above
Subject coverage: see above
What is indexed: document titles, headings, links, content: 100 most
"weighty" words (using an algorithm which considers word placement and
frequencies, among other factors) from the documents; first 20 lines; size in
bytes and number of words
Search entries: Keywords. Boolean queries (from Sept 1995).
Phrases/adjacency not yet implemented. (cf. Lycos Search Language)
Indexing software: Lycos web explorer, can bring in 5,000 documents per
day
Search software: Lycos 0.9beta10. The Pursuit search engine provides
probabilistic retrieval from the catalog, taking a user's query and returning
a sorted list of hits (the list is sorted by match score, and only documents
with scores above the threshhold are retrieved). The searcher will prefer
documents that match more of your search terms, that match your term
more closely (glow matches glows better than glowworm), that have more
occurences of any one term, and occurences earlier in the document.
Appending a period (.) to a term forces an exact match on that term.
Negation and prefix matching possible to influence the results and weights.
Result displayed: all indexed information (link, document outline, keyword
list, excerpt, size). Possible to set the number of hits in the forms-based
search.
Update period: The index is updated weekly.
Performance:5 computers handling up to 106,000 users per week.Good times
to search are before 11am EST, or after 6pm. Heavy usage.
Remarks: largest, most up-to-date, good harvesting and indexing
principles. Server lists and frequency statistics available in Lycos Results.
Criticism: old versions of pages, fetched since Nov 21st, are retained
Description: December/Randall: "WorldWideWeb Unleashed". SAMS.
(1994) pp.399-400; Documentation and Frequently Asked Questions etc. at
the Lycos home page. ``Web-Agent Related Research at the CMT,'',
Mauldin & Leavitt, Proceedings of the ACM Special Interest Group on
Networked Information Discovery and Retrieval (SIGNIDR-94), August
1994 (slides for SIGNIDR-94 talk).
``New Spiders Roam the Web,'', John December, Computer-Mediated
Communication Magazine, 1(5), Sep. 1, 1994. 

---

Beste Gruesse,
Traugott Koch
+---------------------------------------------------------------------+
| TRAUGOTT KOCH,    Electronic information services librarian         |
|  LUND UNIVERSITY LIBRARY, Development Department NetLab,            |
|  P.O. Box 3.  S-221 00  Lund, Sweden                                |
| Tel: int+46 46 2229233    Fax: int+46 46 2223682 or 2224422         |
| E-mail (Internet): traugott.koch _at__ ub2.lu.se                          |
| URL:<A HREF="http://www.ub2.lu.se/person_tk.html";>Traugott Koch</A>,|
|<A HREF="http://www.ub2.lu.se/";>"Lund Univ. Electronic Library"</A>  |
+---------------------------------------------------------------------+



Listeninformationen unter http://www.inetbib.de.