Home Messages Index
[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index

Re: Google images -- what won't they index?

  • Subject: Re: Google images -- what won't they index?
  • From: David Dyer-Bennet <dd-b@xxxxxxxx>
  • Date: 18 Sep 2005 20:25:56 -0500
  • Newsgroups: alt.www.webmaster
  • References: <874q8jpob4.fsf@gw.dd-b.net> <dgjmoc$kat$1@godfrey.mcc.ac.uk>
  • User-agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.4
  • Xref: news.mcc.ac.uk alt.www.webmaster:289341
Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx> writes:

> __/ [David Dyer-Bennet] on Saturday 17 September 2005 19:00 \__
> 
> > I don't mean content, I mean what technical things about the way an
> > image is presented on a page will prevent them from indexing them?
> > And I mean in images.google.com, NOT the basic web search.
> 
> 
> You can use metedata to prevent crawling. You must ensure that crawlers have
> no path via which they descend to images. Image file themselves have little
> or no information associated with them (magic number et cetera?).

And since I *want* these indexed (well, not desperately; but would
prefer), I have avoided all these things that would block it.
robots.txt doesn't block those directories, no meta tags "no-index" or
whatever that one is (I generally just use robots.txt), etc.  

Actually, image files can have a lot of text information in them these
days, or at least they can -- EXIF and IPTC data in particular.

> > No money rides on this (for me, I mean; I'm sure some people have
> > business models that depend on google indexing their images), I'm just
> > curious.
> > 
> > My online images are mostly presented in thumbnail pages (static html;
> > generated from a script just once, not on-demand), with the thumbnails
> > linking to a script that produces a page with the full-size image and
> > other associated information (caption, tech info, navigation links to
> > walk through the gallery without returning to the thumbnail;
> > conventional stuff for a photo gallery).
> 
> 
> Thumbnails will often get indexed before the full-sized image equivalents. I
> think that's rather intuitive. Just as a viewer sees thumbnails first, so
> will a crawler (bot).
> 
> Wait until more crawling is completed. That would be my advice. Images index
> in Google gets refresh 2 or 3 times a year, but crawling remains persistent
> all year.

This ongoing issue goes back many years for me. 

> > And the full-size images mostly don't get indexed.  In a few places
> > where I've put up a static page with inline images, they *do* get
> > indexed -- suggesting that Google is willing to index images on my
> > site.  And the logs show that google does spider me pretty regularly.
> 
> 
> How big are the full-sized images? I have seen galleries with 4 MB JPEG's
> that are barely compressed or lossy.

Mostly under 60k, a very few up to 200k.  

> > I've been poking slowly at my problem here for several years now.  I
> > suspect that google is unhappy with some of the headers my script for
> > the full-size image page produces.  I've been playing with those,
> > trying to get them as vanilla as possible...
> 
> 
> I suggest that you don't intevene until you get definite answers. You might
> be throwing away valuable time, damaging your site in the process.

But if I *don't* ever play with things, I can be sure things *won't*
ever improve.  As I say, I've been chipping away slowly at this for
the last several years.  Some of these galleries originated on the web
in their original form in 1994 or 1993.  

> What is the nature of these scripts?

There's basically one script that's currently in question, which I
call "picpage" (it'll be useful, perhaps, to be able to refer to it
specifically by name if the discussion continues).  It's invoked when
a visitor clicks on a thumbnail on the index page for the gallery, and
it sends a page containing some navigation, some information about the
image from various sources, and the "big" (screen resolution) image
inlined.  

For example, <http://dd-b.lighthunters.net/gallery/macro-sept-2005/>
is the thumbnail page for a recent set of macro photos I took.  (That
page is static HTML sitting on the server; that HTML was generated by
a script when I created this gallery, but I don't think that's
relevant -- the HTML is generated by the perl CGI library, and looks
ordinary enough to me.)  

If you click on the "Bee shadow on morning glory" thumbnail, the URL
fetched is
<http://dd-b.lighthunters.net/gal/picpage/gallery/macro-sept-2005?id=ddb%2020050915%20010-048>.
This invokes the picpage script ("gal" is configured as a mod_perl
directory), the additional path information specifies which directory,
and the id parameter specifies which image in that directory.  Picpage
generates and sends a page to display that one photo, with caption and
other information, and with navigation links, in particular to the
previous and next image in that gallery. 

> > ...In particular I've made
> > sure that I generate a reasonable "last-modified:", and I've put in a
> > "cache-control: public" (I have no reason to believe google pays
> > attention to that, but a number of cache strategies don't cache
> > dynamic pages unless explicitly told to, so it's the right thing for
> > me to do on these pages).  And that the script responds correctly to
> > an if-modified-since header (and in fact I've got in the log a recent
> > example of googlebot receiving a 304 response on the URL of one of
> > these pages, so googlebot does send if-modified-since).
> > 
> > And google clearly indexes stuff in dynamic photo albums -- it's easy
> > to find examples.  I've looked at the headers that Gallery, for
> > example, generates, and I don't see anything "wrong" with mine, but
> > there are cetainly differences.
> 
> I use Gallery too and Google appears to index it properly. The mistake I
> once made was that I changed album names (slugs). Even 4 months later, I
> still get many 404's as a result. I don't think there is something
> inherently bad with Gallery and its interaction with bots.

Yes, it seems to me that it *doesn't* encourage caching.  When I hit
the browser back button from viewing an image to get to the Gallery
thumbnail page, it's clearly regenerating and resending the page.
This makes the performance suck, and loads down the server.  

(I've run Gallery for some users on my server, and it's certainly a
pig from the server end!  And it's much better than my scripts for
some people for some uses -- notably for ignorant users.  I'm not
trying to please anybody but myself on the gallery-maintainer end of
my own scripts, luckily!)

> > Has anybody worked through this issue and has a clear characterization
> > of what Google images (images; not the main google web search) will
> > and won't index?  And is willing to share?
> 
> 
> Here are a few observations:
> 
> -Google bind image descriptions (keywords), or vice versa, to the page title
> and headers, probably using some word density tests and finding captions.
> They also appear to be useing the name (filename) of the image. I have no
> evidence to suggest that the alt attribute gets used much, if at all, in
> the assigment to keywords.

It's interesting that they appear to prefer my (few) inline galleries,
where half a dozen images are displayed on a single page, rather than
the picpage pages where each photo is alone on the page and both the
page title and a DIV directly above the image give a short
description.  In fact that's part of why I've been suspecting that the
problem is somehow a result of the dynamic serving of the pages,
rather than of the HTML content. 

> -Google Images prefers to return large and clear images to Google Images
> users. It does not return thumbnails too often. Moreover, it might be able
> to detect presence of small/larger version and make a senseible choice,
> removing duplicates in the process.
> 
> 
> > There's some slight indications that my latest round of script changes
> > is, maybe, now something they're willing to index, but it's only been
> > up a few days so no results are in the actual index yet.
> 
> 
> Experimentation like that needs to account for many more factor. You cannot
> just isolate one. It's like a high-dimensional problem with so many
> parameters (PageRank, algorithm changes and so forth). You can fall under a
> self-imposed illusion at best.

Yes, I'm vividly aware of that.  I don't *think* I have any definite
beliefs about what they do that are wrong -- but that's mostly because
I have so few really *definite* beliefs about what they do :-).

I know the turnaround time on experiments is long, and that it's not
always easy to even *tell* when the result of an experiment is now
visible (obviously, if the experiment changes the search results,
that's easy to detect, but many unsuccessful things one tries *don't*
change the results). 

> As I said before, it may take quite a few months for the index to get
> modified. The past Google Images update was around 1-2 months ago. I can
> clearly remember announcing it in alt.internet.search-engines.

The information on how infrequently they update the image database in
particular is *immensely* useful, thanks very much!  I wasn't exactly
*assuming* they updated as frequently as the text search side, but I
didn't know it was that big a difference.

And the other observations at least help me believe I'm not off on a
completely blind alley or an insane quest; though they don't open up
any new areas for me to consider.

And that sounds like a newsgroup I need to cast an eye over, thanks!

> Hope it helps,

Yes, greatly appreciated.
-- 
David Dyer-Bennet, <mailto:dd-b@xxxxxxxx>, <http://www.dd-b.net/dd-b/>
RKBA: <http://noguns-nomoney.com/> <http://www.dd-b.net/carry/>
Pics: <http://dd-b.lighthunters.net/> <http://www.dd-b.net/dd-b/SnapshotAlbum/>
Dragaera/Steven Brust: <http://dragaera.info/>

[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index