I recently moved my personal blog off of Blogger and — for some odd reason — reset the comment control on this blog to only allow members of the blog (me) to post comments. Sorry about that.
Rather than try to repost comments to my previous article (Google’s fictitious clicks are more myth than fact), let me just respond to a couple of people here.
First, Shuman Ghosemajumder wrote to share the following comment he was unable to post:
Michael, unfortunately you’ve misinterpreted the point a bit here. The back button itself does not cause a page reload on all web pages. A page reload is caused on many web sites which utilize dynamic pages or nocache directives (which includes many advertising and commercial sites).And code placed on an advertiser’s landing page expressly for the purpose of tracking visits to that page, usually has a nocache directive to prevent that code (e.g. an image or JavaScript tracker) from getting cached. So in those cases the back button almost always reloads the tracker, and thus generates another entry in the log.
If you’d like to try the actual experiment, do what you did above but with a dynamically generated page and then see if you can tell the difference between the original page load (the one that happens after the ad click) and subsequent reloads using any additional browser information. You can’t - and that’s where the actual tracking problems arise.
Now For the record, I personally have been analyzing raw Web server logs for many years and, frankly, I wouldn’t trust a third-party analytical tool farther than I can throw it. I’m not set up right now to play with dynamic pages on a site that gets little enough traffic for me to easily capture single-user click data.
But in formal logic, as you may be well aware, every argument fails at the first flaw. I generally stop looking for problems in people’s presentations when I find a single flaw. Blame my math professors in college for tossing back reams of proofs because I miswrote one little thing. Still, I’ve looked at the fictitious clicks issue in more than one way and I find more than one flaw.
That said, there are many ways to tell a browser not to cache your page. Unfortunately, I’ve found that a lot of people have trouble telling browsers not to cache data. I doubt any one method is used by a majority of all PPC advertisers, but that’s just my gut feeling based on ignorance. I’ll come back to this point further on, but it would be nice to know how many “many advertising and commercial sites” really is (and what percentage of all advertising and commercial sites using PPC that is).
Also, I should note that Google’s Web Authoring Statistics from December 2005 suggests that attempts to control the cache from on the page are futile. The document also suggests that most sites should be doing this from the server — a scenario I feel is highly unlikely in the case of the majority of PPC advertisers (many of whom are so technically naive they hire other people to manage their PPC campaigns for them without granting full server access).
I tested the HTTP 1.0 protocol http-equiv meta tag “pragma” with content=”no-cache” and found it has no effect on Internet Explorer. Subsequent research on the Web confirmed my test. IE basically ignores the meta tag because of the way it caches pages. This is a very common meta tag, in my experience. In HTTP 1.1, “Cache Control” takes pragma’s place. Tried that. Didn’t work. (Note: The cache-control meta tag is only mentioned once in the Google Web Authoring Statistics document, whereas pragma received some discussion.)
In looking at a number of discussions of techniques for using Javascript to block caching, more than one person ran into problems. In fact, one person even pointed out that by the time a browser has begun to parse your Javascript, it has already downloaded (and therefore cached) your Web page. That said, a number of people have tried to force page reloads by appending a “cache=” parameter to URLs and using a dynamic value (usually date and time).
Apple tells you how to employ a complex set of headers to force Safari to avoid using its cache when you hit the BACK button. But I haven’t seen these headers in commercial sites I’ve been evaluating for other purposes over the past couple of years. So even if other browsers honor the method, who is using it?
One Web site suggested you can include a second HEAD section at the bottom of your page to force IE to reload. Sorry, ain’t seen much of that, either.
I did find a working example of a site that appears to bypass the cache. If you click here and then click on the “caching test page” link, then hit your BACK button and then hit your FORWARD ARROW button, you can bounce back and forth between the two tests and you’ll see that the reported fetch times change (implying that your browser cache — even in IE — is being bypassed).
That site uses PHP to force the page reload. I don’t have the time right now to fiddle with my server (which supports PHP) to set up the test.
The W3C says you can set your server to send an AGE value that will tell the client to reload the page. Hands up. How many of you PPC advertisers are doing this?
Now, we’re entirely into the realm of: I have no data on who uses what to control browser caching behavior on their landing pages.
That’s what it all comes down to. Google should actually be able to tell what people are doing because it fetches just about every page on the Web. But I suspect, from reading Shuman’s blog and the report they issued last August, that Google doesn’t allow one hand to know what the other is doing. It’s a big company and an even bigger World Wide Web. Let’s put a little context together here.
Shuman discussed Andy Beal’s click fraud comments in December 2006. In explaining Google’s position, Shuman wrote (after pointing out there is a difference between “invalid clicks” and “click fraud”):
… the quantity of invalid clicks which we detect as a result of reactive investigations is a “negligible proportion” of the total number of invalid clicks. Andy asked me if that percentage is less than 2%. I told him that I was not able to provide a bound, but yes, “negligible” certainly means less than 2% of invalid clicks.However, more significantly, this is quite a different thing than saying that our “click fraud rate” is less than 2%. When we mark clicks as invalid because of suspected malicious activity, the vast majority of the time we do so proactively, and none of those cases are included in the reactive figure in question.
Now let’s scoot forward to Why Third-Party Click Fraud Estimates Don’t Add Up. There, Shuman wrote:
We did an analysis of Click Forensics and other click fraud consultants back in August 2006 to see why their numbers were so inflated (see “How Fictitious Clicks Occur in Third-Party Click Fraud Audit Reports” on the Google AdWords Blog).
Let’s go back to the original document that caused all this fuss:
Fictitious clicks due to detection of page reloads as ad clicks. This is the counting of page reloads on an advertiserÂ’s site as multiple clicks on the advertiser’s AdWords ad — which did not actually occur. Page reloads can occur for various reasons, including:
- user browses more deeply into the advertiser’s site, then hits back button, causing a potential reload of the original landing page
- user presses browser reload button on the landing page
- user opens a new window in Internet Explorer, causing a reload of the landing page
Fictitious clicks due to conflation across advertisers and ad networks. This is the counting of one advertiser’s traffic in another advertiser’s report, even if the advertisers span different ad networks.
These two problems are serious, and have resulted in significant inflation of click fraud estimates from each of the click fraud auditing firms we examined.
Okay, these are plausible alternatives to help explain what is going on. Furthermore, Google is really only responding to data provided by third-party auditing services in a handful of case study incidents. However, in Appendix A, they tell us that, after a user has clicked on an Ad on Google and been transferred to a dynamically generated landing page (which includes capture code from the third-party auditor), if the user clicks through on the CONTACT page and then uses the BACK button to return to the dynamically generated landing page, the referrer data is sent back to the server.
All neat and cool, Google. I’ve no desire to argue with you about whether referrer data is sent back on an uncached dynamically-generated page. But I have a question. How many of your advertisers actually use this set up? Including third-party auditing and dynamically-generated non-caching landing pages that force browsers to send back referrer data?
Now let’s look at Why Third-Party Click Fraud Estimates Don’t Add Up - Part 2. Here Shuman writes:
…The analysis that we see from third-party auditing firms (including ClickForensics) seems to essentially rely on just one factor, which we call IP frequency. IP frequency is the number of times an IP address clicks within a certain time window. If it clicks too many times, it could be click fraud. On our end, this is a very simple rule which runs in an automated fashion, protecting Google advertisers 24/7. Third-party firms sometimes find the same suspicious IP frequency patterns that our systems do, and include them in their click fraud reports - leading advertisers to request refunds for clicks they were never charged for in the first place.
That appears to be a reporting issue on Google’s side. Of course, some people will just be overwhelmed by numbers and won’t look at the warnings and disclaimers. So Google cannot prevent all confusion and I won’t hold them to that standard.
But we’re still not addressing the core concern. Here’s where I really start to have trouble with the official Google response:
But that is actually not even the most common problem with their analyses. What is far more common is that the reports we receive from them ask for refunds for clicks which do not even exist. This more serious problem comes from the issues we addressed in our August report on fictitious clicks. In that report, we demonstrated the limits of web log based analysis for any analytics purpose (including click fraud analysis) due to the way Internet Explorer, Firefox and other browsers work. Unfortunately, that was a very technical report, which was difficult for many readers to parse. I’ll try to provide a simpler explanation here.
Um, no, you did not demonstrate the limits of web log based analysis.
What the August 2006 Google report demonstrates is that some third-party auditing firms may not be anaylzing their data captures very well. That’s hardly an indictment (much less a credible indictment) of “web log based analysis”.
Shuman also said:
Here’s the problem: web logs, whether generated by an advertisers, or by third-party code on an advertiser’s site, cannot directly track ad clicks….
But wait! Google captures click-throughs in its own server logs (as noted a little further on by Shuamn). The message we’re getting here is that only Google can correctly interpret the data. It would be more accurate to say that only Google has access to Google’s raw data captures.
And there is more:
…Instead, they track visits to a special landing page URL on the advertiser’s site (e.g. http://example.com/?adwords ) as a proxy for how many ad clicks occurred. The assumption they’re relying upon is that each visit to that URL corresponds to a unique click, and vice versa. But in practice this is not the case. Once a user visits that page, they often browse through the site, navigating through sub pages, and then return to the original landing page by hitting the back button….
Um, How do you know?
How many advertiser Web server logs does Google analyze? I don’t mean “how many Web server logs from advertisers who ask for refunds”, I mean literally “how many advertiser Web server logs”? What percentage of your advertisers open up their logs to you to show you how their users behave?
…When the landing page is reloaded in the browser, it appears in the web log as though additional ad “clicks” are occurring….
Maybe. Maybe not. Dynamically generated landing pages don’t appear to be a requirement of the program.
…Google can count ad clicks reliably as a click on a Google ad will cause the web browser to contact Google and then we redirect it to the advertiser’s landing page. A reload of the advertiser’s landing does not contact Google again….
Absolutely. I’ll agree with you 100% on that. But how many merchants out there are actually using the design you’re stipulating?
Your explanations are only valid for those sites that meet the criteria of the case studies upon which you’re telling everyone they are only seeing “fictitious clicks” rather than fraudulent clicks.
In other words, your analysis is flawed because you’ve done nothing to address the concerns of the majority of advertisers. Furthermore, the August report says: “ClickFacts sometimes incorrectly identifies perfectly legitimate comparison
shopping behavior (where a user visits the advertiser two or three times within a span of 10 to 20 minutes) as fraudulent.”
While it’s good to know that Google is aware of comparison shopping behavior, what interests me (and I realize this may be something Google won’t reveal) is how does Google know when someone is “comparison shopping” and when someone is just clicking on links?
I think what would be more informative and ultimately better persuading would be for Google to openly talk about the limitations of its system. What would it take to manipulate the advertising — to implement true, legitimate click-fraud? You are not going to allay advertiser concerns by attributing the bulk of perceived fraudulent clicks to fictitious clicks.
Let me move on to another comment by Linden:
Before performing the tests, did you make sure your browser is not caching the page? Many people, including myself, have their cache turned off. On Firefox there is also a memory cache that can be turned off (in addition to disk cache).
My browser here at home is caching.
I think what we can take away from this is that there is a lot of raw data that has not been released which, if made available for independent analysis, might help substantiate Google’s case. Google is in the best position to know how much click fraud activity there may be (and how much of that activity they may be stopping). But just because Google is in the best posititon to know doesn’t mean it actually does know.
PPC click fraud will probably become one of the greatest conspiracy theories of all time. Unlike the Kennedy assassinations and UFO sightings, a lot of people have a great deal of money and future income riding on the integrity of PPC advertising. There is clearly insufficient third-party accountability to ensure that Google, Yahoo!, Windows Live, and other PPC network advertisers are doing an adequate job of protecting advertisers against malicious clicks.
I hope my comments are working again. This is really all I intend to say for now.
{ 1 comment… read it below or add one }
softplus 02.04.07 at 3:06 pm
One giant issue is that we have no way of knowing how Google measures and discounts fraudulent clicks.
Assuming it is possible to track clicks absolutely the same as Google tracks them. It’s easy to recognize high IP frequencies. However, where is the threshold to click-fraud? Where is the threshold to “same-user, different IP” click fraud? (of course we’ll never know: it’s part of the secret protecting advertisers from publishers who do borderline click-fraud)
Assume we register 5000 clicks and recognize that 1000 are certainly fraudulent and further 1000 might be. Google might bill us for 5000? 4000? 3500? 3000? clicks - where do we start to complain? How can we “prove” that those clicks are fraudulent? What can we do if we could catch the click-frauder inflagranti? Electroshocks would be nice, but probably hard to get through the popup-blocker…
You must log in to post a comment.