Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

AOL (accidentally) releases search data


  • Please log in to reply
38 replies to this topic

#1 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 07 August 2006 - 06:20 PM

I just stumbled upon this posting at slashdot (I'm sure there are lots others :)).

Aside from the privacy issues that this causes, what could be learned from this data from a search-engine point of view?

It would be really interesting to take the queries apart, look at the "long tail" and how users use search engines (assuming that AOL users are average :)). I don't think any other engine has released similar data (or have they?).

John

#2 joedolson

joedolson

    Eyes Like Hawk Moderator

  • Technical Administrators
  • 2902 posts

Posted 07 August 2006 - 06:31 PM

Well, the release wasn't exactly an accident - more like a serious error of judgement, I'd say. There's a lot of uproar about the concerns for privacy - SearchEngineWatch has published a good article summarizing the brouhaha.

AOL Releases Search Data & Raises Privacy Concerns

#3 Nadir

Nadir

    Light Speed Member

  • Members
  • 976 posts

Posted 07 August 2006 - 06:55 PM

Aside from the privacy issues that this causes, what could be learned from this data from a search-engine point of view?


Well, even if the data only represents 1,5% of all searches made in May - as AOL Spokesman said - , I do think that from a search engine marketer point of view, we can draw some interesting conclusions.

For example, you can see:
- how many sites on average a user visits when he's running a search;
- how he refines his keywords: does he add a verb, an adjective, a price?;
- how far he's going in the SERPS;
- what result did he click on first, and why? Is that the title, the description, the length of the title etc. that made him click?
- and so on ;-)

Edited by Nadir, 07 August 2006 - 07:02 PM.


#4 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 07 August 2006 - 10:34 PM

I had received an email from AOL also about their wiki site this last week, and had exchanged a number of emails about the data in question and other information that was being made available on their site.

I don't believe that there was any intention at all to harm anyone, or infringe upon users privacy.

Being able to see actual queries, and query refinements, the way people search for information, and whether they add words to a query, delete words, make spelling corrections, and so on, is some valuable information in that it provides a way to see how people actually search the web.

I'm pretty sorry to see the trouble that this has stirred up. I wrote a blog post about this a few days ago, and was going to mention the page in my presentation at SES tomorrow, because it's very much on topic with what I was going to present. I think that I still will, but maybe as an example of the potential problems with sharing that type of information.

Nadir does a great job of highlighting some of the value of that type of data. I didn't have a chance to download the information because I was too busy preparing for this trip.

#5 projectphp

projectphp

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3935 posts

Posted 08 August 2006 - 12:11 AM

What scares me the most are the wife searches, and not because of the reasons people will think.

Just imagine that person was an author, researching a crime book he was writing about a husband who killed his wife with a staged accident, who used the internet to research the idea, and you have perfectly legitimate search behaviour made to look bad.

The scary thing about inferring is that the you need to know something before you start. Imagine four drunk mates talking about the Zeitgeist of our age, terrorism, and one claiming you could find plans to make a bomb in 15 minutes omn the interent, another saying it wasn't true, just a media beatup, and them betting $100 to see if the guy could back it up and the invasion of privacy concerns are, IMHO, very real.

#6 ukdaz

ukdaz

    Light Speed Member

  • Members
  • 738 posts

Posted 08 August 2006 - 01:26 AM

Is there anywhere where this data is still available - I'd be interested in taking a look. Seems to have been taken down....???

Daz

#7 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 August 2006 - 01:37 AM

Hi Daz - try here: they have lots of mirrors.

I was with a client a week or so ago, we showed them how easy it was to "hack" their wifi encryption and get into their network. He was not impressed. When I showed him how I could track his users web searches in real time, he asked me to fix it. Those search queries could give a competitor a real insight into the kinds of ideas that are being played with: the things so far back in the "brainstorming" phase that someone who knows about them ahead of time can beat them to the market.

This data is going to be fun to play with ;). As Nadir said, if you can watch the user refine his queries, you can learn quite a bit.

Is anyone else here taking it apart?

John

#8 Nadir

Nadir

    Light Speed Member

  • Members
  • 976 posts

Posted 08 August 2006 - 01:54 AM

Is anyone else here taking it apart?


I've been trying to open one of the 200 MB or so text files with my computer at work yesterday but it crashed. I'll try today with a colleague's computer that has more RAM, but if anyone knows how to deal with such big documents, please let me know.

#9 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 August 2006 - 02:29 AM

Hi Nadir
I could transform it into a database ... but which format? (would it be interesting to have it online?)
John

#10 gabs

gabs

    Whirl Wind Member

  • Members
  • 93 posts

Posted 08 August 2006 - 05:30 AM

Its a day of writing some phat sp's for me...

#11 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 08 August 2006 - 05:34 AM

I'm starting to dig into it. It's a huge dataset (>2GB) of text files. Too many ideas right now, but a lot of linkbait can be generated from analyzing this data.

Oh yeah, a lot of insight too :D

Pierre

#12 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 August 2006 - 05:55 AM

I've never played with a simple database that big before :).

By the way, if anyone is importing it into a database, theres a typo on line 1171533 in file 08 (clickurl="http://www.breedersclub.net"): the ItemRank should be "6". (really!!) :D

John

#13 Mano70

Mano70

    Mach 1 Member

  • Members
  • 256 posts

Posted 08 August 2006 - 06:03 AM

As mentioned on another forum I'm looking at it. Have so far imported three files into Access, which means 10853830 lines of data. Have missed some data during the import, but not much.

#14 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 August 2006 - 06:06 AM

Hi Mano
You're going to run out of room. What's the limit on Access DBs? (isn't it 1GB? 2?) my MS-SQL db is now around 6GB with lots of indexes on it. I hope I can verify that I have all the data in little bit and perhaps I can put it online for people to play with (once I get the fire extinguisher out of the other room, my little server says "ouch" already) :D

John

#15 Mano70

Mano70

    Mach 1 Member

  • Members
  • 256 posts

Posted 08 August 2006 - 06:38 AM

I think you may be correct, but I just got errors when I tried to export to MS-SQL so I gave it a shot. Access has now passed 2GB, and are still working, although it's complaining much.

If I don't have any success with this it would be interesting to get the MS-SQL db, I would like to see if SPSS is able to do some fun with the data.

Edit: Access will not chew more info now, stopped at 2097088Kb.

Edited by Mano70, 08 August 2006 - 06:41 AM.


#16 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 08 August 2006 - 07:06 AM

Um... Guys, why don't you use grep to extract the relevant lines first, and then export the much smaller subset into Access or SQL or whatever? Much faster and more robust :D

#17 Mano70

Mano70

    Mach 1 Member

  • Members
  • 256 posts

Posted 08 August 2006 - 07:15 AM

We wan't it all! :D

To me which keywords etc. isn't a big deal, I'm more interesting in behaviour. Nadir points out a few.

Exporting to MS-SQL now seems to run smooth (7 files so far).

Edited by Mano70, 08 August 2006 - 08:06 AM.


#18 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 August 2006 - 08:08 AM

I'm fully databased :-)

currently creating sub-tables for words and more ... and coding lots of mistakes with my mind on "non-work" (like this) :D

#19 Mano70

Mano70

    Mach 1 Member

  • Members
  • 256 posts

Posted 08 August 2006 - 08:23 AM

Fully databased also. Creating a backup now, and parks this for some hours.

I got 36 389 567 rows.

Edited by Mano70, 08 August 2006 - 08:26 AM.


#20 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 08 August 2006 - 10:02 AM

An interesting sideline. Sometime back, while researching some spam, I noticed that a lot of the links that Google was spitting out were from AOL's search engine cache! I was amazed that Google was doing that. They have been doing it not only with AOL, but also with other big sites that use Google technology for the web. I will look if I can find the link and if I do I will post it here. That was one of the methods the Moldanian Black hat used either willingly or unwillingly to propagate links!

I wonder if the words 't1ps2see' are in that database? :D Can anyone check?

Yannis

#21 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 August 2006 - 12:39 PM

I also have 36 389 567 entries :). What would you guys (and gals :)) like to see?

#22 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 08 August 2006 - 06:05 PM

Someone has already made the database searchable here. Pretty quick.

It returns websites for keywords, but it's better than nothing for some curious heads out there.

I wonder if there are any DVDs with text data over there. Might be great if there was a text and MySQL version of that on a single DVD :)

#23 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 August 2006 - 06:20 PM

Doesn't work for me :). But it's pretty quick at returning "0 results" :) :)

Mine's almost ready .... but I'm not sure if the server can take being put online, we'll see ... I already killed 2 of these databases, now the "final" one is being re-created, takes a bit of time. Playing with 36 million records makes you wonder how Google manages those billions (ok, their servers are probably a bit larger :D)

John

#24 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 09 August 2006 - 06:09 AM

I finally got my first try at it working :) - it's online for testing at http://oy-oy.eu/temp/aol/ (just the first 10% for the moment, the rest should come online during the day). Lots of cross-linking is possible, I'm adding more ideas as I find them (or hear from you all).

Let's see how the server holds up with the testing, I might have to move it to a higher-bandwidth one (anyone have IIS hosting with multi-GB MS-SQL databases for cheap? :))

John

#25 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 09 August 2006 - 07:58 AM

Good stuff, John!

I tried it a few times, but managed to get it to time out when I searched for a very popular keyword... myspace.

Pierre

#26 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 09 August 2006 - 08:31 AM

Ha ha! Yeah, I'm currently appending the rest of the data, it seems to be blocking most database queries (timing them out), but all the cached queries work ok (it caches just about everything to save db accesses). In about 30 minutes the next data blocks should be pushed through (50% of the data online), I'll make a break there, let you play with it and do the rest tonight.

Playing with it gives you some strange statistics ideas :). What's the most mispelled domain name (used for search)?

#27 Mano70

Mano70

    Mach 1 Member

  • Members
  • 256 posts

Posted 09 August 2006 - 01:39 PM

I won't compete with you softplus with my db, haven't had time to improve it (and don't think I will). The db is also to big for me to put online, I don't have a multi GB database. Btw, how big did your db get?

#28 3rdeye5

3rdeye5

    Gravity Master Member

  • Members
  • 154 posts

Posted 10 August 2006 - 01:10 AM

The New York Times has an interesting article about the AOL search data. They tracked down the person belonging to one of the numbers, and did an interview with her.

Ewald

#29 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 10 August 2006 - 03:55 AM

There are a few https queries in there with the full URL, including session-ids ... :( Some of the items in there are *really* interesting :) - now if only I could get the db running fast enough on my VW-server ;)

#30 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 14 August 2006 - 07:13 PM

Some interesting queries to play with (it's slow, be patient :)):

Search-engine "newbies", searching for "how do I ..." http://oy-oy.eu/temp...aspx?w=how do i (click on the User-ID to follow their other searches)
Similarly: http://oy-oy.eu/temp...spx?w=how can i (etc.)

Since we're tracking AOL ...
http://oy-oy.eu/temp...px?w=cancel aol

Isn't AOL search a branded Google?
http://oy-oy.eu/temp...=www.google.com (people searching for the full URL of Google ;))
or http://oy-oy.eu/temp...g...mp;p=0&s=5d (search for Google's URL, click on Yahoo, MSN, Lycos, Hotbot?)

John

#31 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 14 August 2006 - 09:04 PM

John

Thanks for the links and for the trouble to put all this information on the web. So far what I have noticed both from the database as well as from my own observations from my websites is that keyword searching is almost dead.

Experienced users will create a search phrase enclosing relevant terms and inexperienced users will just type a question the way they would ask it.

This makes on-page optimization in a way easier.

Is it possible for you to query the database and produce some statistics for this? i.e

- Total number queries
- Queries with one keyword
- Queries with two keywords etc...

Yannis

#32 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 15 August 2006 - 02:14 AM

"etc"? ;)

Come on, out with what you really want :)

I noticed that a lot of "junk" queries have keywords more than once - should I try to give them "scores"? Perhaps categorize them into groups? I'd love to find out how many of those "question" queries are out there and where they usually go to: size / popularity of the sites and rank clicked on (first page or later?).

There are a lot of interesting things in there, but it's hard extracting them. I really respect Google + co's search engines a lot more after playing with this "small" subset of data :)

John

#33 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 15 August 2006 - 02:59 AM

Words	Count	Percent
1	12139510	36.8%
2	8301090	25.2%
3	5701125	17.3%
4	3304800	10.0%
5	1735512	5.3%
6	871473	2.6%
7	427384	1.3%
8	218475	0.7%
9	119087	0.4%
10	65997	0.2%
11	37426	0.1%
12	22413	0.1%
13	14053	0.0%
14	9148	0.0%
15	6467	0.0%
16	4509	0.0%
17	3157	0.0%
18	3429	0.0%
19	2011	0.0%
20	1540	0.0%
21	1847	0.0%
22	1050	0.0%
23	670	0.0%
24	567	0.0%
25	525	0.0%
26	441	0.0%
27	369	0.0%
28	369	0.0%
29	320	0.0%
30	294	0.0%
31	285	0.0%
32	233	0.0%
33	227	0.0%
34	198	0.0%
35	169	0.0%
36	154	0.0%
37	201	0.0%
38	221	0.0%
39	194	0.0%
40	155	0.0%
41	230	0.0%
42	384	0.0%
43	187	0.0%
44	235	0.0%
45	287	0.0%
46	187	0.0%
47	169	0.0%
48	219	0.0%
49	182	0.0%
50	258	0.0%
51	138	0.0%
52	82	0.0%
53	68	0.0%
54	59	0.0%
55	49	0.0%
56	38	0.0%
57	16	0.0%
58	4	0.0%
59	10	0.0%
60	10	0.0%
61	3	0.0%
62	6	0.0%
63	3	0.0%
64	1	0.0%
65	2	0.0%
66	4	0.0%
67	2	0.0%
68	1	0.0%
69	3	0.0%
70	3	0.0%
71	4	0.0%
72	2	0.0%
73	17	0.0%
75	4	0.0%
77	4	0.0%
78	4	0.0%
79	1	0.0%
80	2	0.0%
82	1	0.0%
83	5	0.0%
85	5	0.0%
88	1	0.0%
92	1	0.0%
95	2	0.0%
96	1	0.0%
97	1	0.0%
98	2	0.0%
99	1	0.0%
100	1	0.0%
103	1	0.0%
106	1	0.0%
107	1	0.0%
117	1	0.0%
127	1	0.0%

Total	32999999

The total doesn't appear right (I think I'm missing something :))... I'll see if I can figure the difference out later on. For what it's worth, many 1-word queries are just "-", which I assume means they didn't search (but then what could they find? or were they "censored"?)

John

#34 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 18 August 2006 - 09:14 AM

I've added:
Top 100 words + unique queries
Top 100 words + unique queries (by month)
Top 200 domains clicked
Top 100 domains clicked by month

It's faster now, but still pretty slow when queries aren't cached.

John

#35 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 18 August 2006 - 09:24 AM

Hmm, domain #168: http://vvvvvv. in. ua (6081) (link goes to AOL db lookup) -> web-spam. I wonder which other spam domains are in the top 200... and if the proportions of clicks are about right for the full search collection (since this is only 1.5% of the total search volume). One would think that if spam sites make it to the top 200 domains then Google/AOL would change something in the algorithm to push them out (or manually remove / penalize them). Interesting...

John

Edited by softplus, 18 August 2006 - 09:26 AM.


#36 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 18 August 2006 - 11:19 AM

John

Thanks for posting those statistics. Is it possible that the one word stats are included again when you counted the two word stats etc.... It can explain the difference in the totals!

The stats you posted are very interesting. For example the overuse of 'I' in queries. This points to some on page optimization, using FAQ i.e 'I want to....' etc.

Also very interesting is the fact that users are lazy to type in the full url's and are using the search engines as entry points (i.e all those typing google!).

These stats need a lot of digesting though!


Yannis

PS Did you fiddle the database for 'John' to come in the top 100? :P

#37 earlpearl

earlpearl

    Hall of Fame

  • 1000 Post Club
  • 1357 posts

Posted 18 August 2006 - 11:43 AM

Softplus that is incredible!

Only the best keyword research information I've ever seen.

thanks. :applause: :applause: :applause:

#38 swainzy

swainzy

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3316 posts

Posted 22 August 2006 - 12:37 AM

AOL fires tech chief after data 'screw-up'

http://msnbc.msn.com/id/14457607/

<Added>
This incredible lack of judgement on AOL's part is starting to have serious repercussions for them.

http://www.forbes.co...facescan01.html

"Not only did the episode hurt AOL's brand image, but there was a possibility that its already-declining subscriber base would be hesitant to share private information, deterring advertisers from using the service.


OUCH!

Edited by swainzy, 22 August 2006 - 08:37 PM.


#39 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 29 September 2006 - 05:47 PM

An interesting listing of queries ... Here's the start of the list of queries that were used often but never had any results clicked. Perhaps a sign of a bad query, or perhaps a sign that there's room in the results for a new page or two :) :)

Query, #-queries, #-clicks
adserver.ign.com	18070	0
mycl.cravelyrics.com	16416	0
recent	6972	0
evo.qksrv.net	4959	0
ads.admonitor.net	3407	0
b.bluetime.com	3135	0
ads.t-kom.com	2806	0
omhttp	1975	0
www.thebestse.com	1613	0
slirsredirect.search.aol.com	1500	0
sp.trafficmarketplace.com	1490	0
web.easysoft.local	1468	0
ads.specificpop.com	1334	0
trust4free.ws	1196	0
sp12.smartpages.com	1191	0
adserver.gorillanation.com	1143	0
www.goo	1135	0
www.trust4free.ws	1134	0
www.mys	1098	0
htpp	1008	0
kostya.reviewcar.org	928	0
ypn-js.overture.com	908	0
jennifre894	903	0
myso	865	0
www.gamesonlyexchange.com	812	0
minirecent	804	0
www.spyk.com	789	0
www.a	785	0
ads.netdok.net	721	0
d.attune.com	689	0
9.0le	641	0
bestcounter.biz	637	0
www.go	634	0
dfloyd	618	0
jdsbanners.com	616	0
onlinenow.myspace.com	615	0
bb68827	580	0
servedby.valuead.com	576	0
videosearch.launch.start	573	0
www.load2load.net	534	0
y.com	524	0
slingoexpress	522	0
fucktheall.com	516	0
es	510	0
remote invocationtype pcsearch.top	510	0
myp	500	0
'	489	0
myfriends.myspace.com	487	0
www.bb	481	0
www.mysp	476	0
screename	471	0
ratingcounter.com	468	0
...oko.org.uk	463	0
bob hoskins	454	0
iframetraff.biz	447	0
go to http	442	0
..	440	0
international chats	421	0
mailer.homescan.com	416	0
e.comhttp	405	0
sales reps for china products	404	0
fhg.videopass.com	398	0
wev	388	0
www.f	385	0
staging.ecom.sears.com	384	0
gool	382	0
traffweb.biz	381	0
www.c	370	0
ys	370	0
chat.youngatheart.com	369	0
ww.my	368	0
craglists new york	366	0
www.gog	365	0
ad.ad-flow.com	364	0
myd	363	0
interracial matches.com	362	0
search.ebay.com	356	0
datingbanner.com	354	0
bft.fpcgalleries.com	350	0
profile.swapfinder.com	349	0
www.good	346	0
charm.fading-petals.net	340	0
banners.1	337	0
mailbo	334	0
yhttp	333	0
b.azjmp.com	331	0
dianne	329	0
blockedreferrer	325	0
aolecards	319	0
www.myso	319	0
l.com	317	0
www.loadcash.biz	317	0
ad	310	0
ghttp	310	0
pictures of sonic and shadow	308	0
nax nychno.com	303	0
gfrancis	301	0
dorki.ya-hoo.biz	300	0
www.myp	298	0
edithcobb	295	0

Who would have thought? I'll add the full list in a clickable / pagable form once I get to it.

John



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users