[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Suchmaschinen: Google indexiert Power-Point, Postscript etc.



Liebe Inetbibler,

eine neue Nachricht von der Suchmaschine Google. Neben PDF-Dateien
indexiert Google jetzt auch Power-Point-Folien, Excel- und Word-Dateien,
Dateien im RTF-Format sowie Postscript-Dateien. Damit werden immer mehr
Bereich des sogenannten "unsichtbaren Internet" (invisible web) sichtbar
bzw. suchbar.
Google hat bereits ca. 22 Mio. PDF-Dateien indexiert.

In der Trefferliste werden diese Dateien ähnlich wie PDF-Dokumente
gekennzeichnet, nämlich mit der Dateiendung in eckigen Klammern vor dem
Seitentitel. Darunter ist dann noch mal das Dateiformat im Klartext
angegeben.

Direkt nach solchen Dateien suchen können Sie mit dem Feld "filetype:"
gefolgt von der Dateiendung (ppt, xls, doc, rtf, ps). So bringt die
Anfrage 
suchmaschinen filetype:ppt
(http://www.google.com/search?hl=de&q=suchmaschinen+filetype%3Appt&lr=)
immerhin 308 Treffer.

Aus dem "Search-Day-Newsletter" leite ich die vollständige Nachricht
weiter:

<zitat>
Google has quietly extended the scope of its web index, for the first
time
including a number of file formats that are all but ignored by other
search engines.  These file formats make up a small but important part
of
the Invisible web, and Google's effort to make them searchable is a
noteworthy advance in search engine technology.

The new file types indexed by Google include Microsoft Word, Excel and
PowerPoint formats, as well as Rich Text Format and PostScript files,
according to Google spokesperson Cindy McCaffrey. Search engines
traditionally have snubbed these file types in favor of those in the far
more common HTML format, which is widely accepted as the universal
standard for web pages.

Google was the first major search engine to tackle non-HTML web content
in
a large way, when it began indexing Adobe Portable Document (PDF) files
in
January and February of this year.  Google's index now contains more
than
22 million PDF files.

Result listings for the new file types look similar to PDF results,
prefaced by a bracketed label to the left of the document title
indicating
its file type.  These new labels are straightforward, simply using the
document's extension to indicate type.  The labels are [doc] for Word
documents, [xls] for Excel spreadsheets, [ppt] for PowerPoint
presentations, [rtf] for Rich Text Format documents, and [ps] for
Postscript documents.

For many types of searches, you may not see any results that include
these
file types, for a number of reasons.  Google is gradually rolling out
the
capability to its data centers around the world.  The new file formats
are
available at two of Google's data centers immediately, and will be fully
accessible worldwide by early next week, according to Google
spokesperson
David Krane.

Another reason that the new formats may not show up in results is that
relatively few numbers of these file formats exist on the web, compared
to
HTML files that make up the majority of the overall Google index.
 
Although Google declined to release specifics about how many documents
in
the new file formats it has indexed, informal testing suggests that the
new formats represent a just a fraction of the 1,610,476,000 total pages
Google currently claims are accessible in its index.  And since non-HTML
files traditionally haven't been regarded by most users as web
documents,
it's highly unlikely that many of these documents have links pointing to
them from other web pages -- a key factor in how Google determines
relevance.

The best way to find information in the new document formats is to
restrict your search to a particular file type, using Google's
"filetype"
operator.  For example:

zamboni filetype:doc
"2000 census" filetype:xls
"investment strategy" filetype:ppt

Google offers two methods for viewing the new file types.  You can view
a
document in its native format by clicking its title in the result list. 
If you're running Internet Explorer, the document will open directly in
your browser window.  If you're running Netscape Navigator or another
browser, a pop-up box will ask you whether you want to open the document
or save it to disk.  In either case, you run a possible security risk by
opening documents that might be infected with a virus or worm.

To help users avoid this safety risk, results for these file types
include
a "View as HTML" option.  Clicking this link opens a bare-bones copy of
the document that has been stored on Google's servers, one that carries
no
risk of infecting your computer.

Don't expect the titles of documents in search results to always make
sense.  Although Word, Excel and PowerPoint documents have options for
specifying titles, few document authors bother using them.  If Google
doesn't find a document title in the document's properties, it tries to
extract a title from the first lines of the document.  If Google can't
sensibly determine the title it simply uses the URL of the file.

Although we're still a long way from being able to use search engines to
find information in online databases -- the mother lode of the Invisible
web -- Google's addition of these new file types is another welcome step
along the path of being able to find information of any virtually any
type
in the huge expanses of the web.

Google Does PDF & Other Changes
http://www.searchenginewatch.com/sereport/01/02-google.html
Google now includes listings of Adobe PDF files from across the web, a
first for any major search engine and a feature long overdue for them to
offer.

How Google Works
http://www.searchenginewatch.com/subscribers/google.html
A detailed look under the hood at all aspects of Google's operation.

(A longer, more detailed version of this article is available to Search
Engine Watch members at
<http://searchenginewatch.com/subscribers/articles/0110-google-filetypes.html>)
</zitat>

Viele Grüße

Sebastian Wolf

-- 
-------------------------------------
- Sebastian Wolf                    -
- UB Bielefeld ; Internet-Gruppe    -
- Tel.: 0521 / 106-4032             -
- E-Mail: wolf _at__ ub.uni-bielefeld.de  -
-------------------------------------



Listeninformationen unter http://www.inetbib.de.