Jump to content

Leading Community for Usability, Search Engine Marketing,
Social Networking, Site Planning & Web Site Development, Since 1998


Photo

AOL (accidentally) releases search data


  • Please log in to reply
38 replies to this topic

#1 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 07 August 2006 - 06:20 PM

I just stumbled upon this posting at slashdot (I'm sure there are lots others :)).

Aside from the privacy issues that this causes, what could be learned from this data from a search-engine point of view?

It would be really interesting to take the queries apart, look at the "long tail" and how users use search engines (assuming that AOL users are average :)). I don't think any other engine has released similar data (or have they?).

John

#2 joedolson

joedolson

    Eyes Like Hawk Moderator

  • Technical Administrators
  • 2869 posts
  • Twitter:http://twitter.com/joedolson
  • Facebook:http://facebook.com/joedolson

Posted 07 August 2006 - 06:31 PM

Well, the release wasn't exactly an accident - more like a serious error of judgement, I'd say. There's a lot of uproar about the concerns for privacy - SearchEngineWatch has published a good article summarizing the brouhaha.

AOL Releases Search Data & Raises Privacy Concerns

#3 Nadir

Nadir

    Light Speed Member

  • Members
  • 976 posts

Posted 07 August 2006 - 06:55 PM

Aside from the privacy issues that this causes, what could be learned from this data from a search-engine point of view?


Well, even if the data only represents 1,5% of all searches made in May - as AOL Spokesman said - , I do think that from a search engine marketer point of view, we can draw some interesting conclusions.

For example, you can see:
- how many sites on average a user visits when he's running a search;
- how he refines his keywords: does he add a verb, an adjective, a price?;
- how far he's going in the SERPS;
- what result did he click on first, and why? Is that the title, the description, the length of the title etc. that made him click?
- and so on ;-)

Edited by Nadir, 07 August 2006 - 07:02 PM.


#4 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 07 August 2006 - 10:34 PM

I had received an email from AOL also about their wiki site this last week, and had exchanged a number of emails about the data in question and other information that was being made available on their site.

I don't believe that there was any intention at all to harm anyone, or infringe upon users privacy.

Being able to see actual queries, and query refinements, the way people search for information, and whether they add words to a query, delete words, make spelling corrections, and so on, is some valuable information in that it provides a way to see how people actually search the web.

I'm pretty sorry to see the trouble that this has stirred up. I wrote a blog post about this a few days ago, and was going to mention the page in my presentation at SES tomorrow, because it's very much on topic with what I was going to present. I think that I still will, but maybe as an example of the potential problems with sharing that type of information.

Nadir does a great job of highlighting some of the value of that type of data. I didn't have a chance to download the information because I was too busy preparing for this trip.

#5 projectphp

projectphp

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3934 posts
  • Twitter:motherwell
  • Facebook:http://www.facebook.com/mmotherwell

Posted 08 August 2006 - 12:11 AM

What scares me the most are the wife searches, and not because of the reasons people will think.

Just imagine that person was an author, researching a crime book he was writing about a husband who killed his wife with a staged accident, who used the internet to research the idea, and you have perfectly legitimate search behaviour made to look bad.

The scary thing about inferring is that the you need to know something before you start. Imagine four drunk mates talking about the Zeitgeist of our age, terrorism, and one claiming you could find plans to make a bomb in 15 minutes omn the interent, another saying it wasn't true, just a media beatup, and them betting $100 to see if the guy could back it up and the invasion of privacy concerns are, IMHO, very real.

#6 ukdaz

ukdaz

    Light Speed Member

  • Members
  • 738 posts

Posted 08 August 2006 - 01:26 AM

Is there anywhere where this data is still available - I'd be interested in taking a look. Seems to have been taken down....???

Daz

#7 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 August 2006 - 01:37 AM

Hi Daz - try here: they have lots of mirrors.

I was with a client a week or so ago, we showed them how easy it was to "hack" their wifi encryption and get into their network. He was not impressed. When I showed him how I could track his users web searches in real time, he asked me to fix it. Those search queries could give a competitor a real insight into the kinds of ideas that are being played with: the things so far back in the "brainstorming" phase that someone who knows about them ahead of time can beat them to the market.

This data is going to be fun to play with ;). As Nadir said, if you can watch the user refine his queries, you can learn quite a bit.

Is anyone else here taking it apart?

John

#8 Nadir

Nadir

    Light Speed Member

  • Members
  • 976 posts

Posted 08 August 2006 - 01:54 AM

Is anyone else here taking it apart?


I've been trying to open one of the 200 MB or so text files with my computer at work yesterday but it crashed. I'll try today with a colleague's computer that has more RAM, but if anyone knows how to deal with such big documents, please let me know.

#9 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 August 2006 - 02:29 AM

Hi Nadir
I could transform it into a database ... but which format? (would it be interesting to have it online?)
John

#10 gabs

gabs

    Whirl Wind Member

  • Members
  • 69 posts

Posted 08 August 2006 - 05:30 AM

Its a day of writing some phat sp's for me...

#11 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 08 August 2006 - 05:34 AM

I'm starting to dig into it. It's a huge dataset (>2GB) of text files. Too many ideas right now, but a lot of linkbait can be generated from analyzing this data.

Oh yeah, a lot of insight too :D

Pierre

#12 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 August 2006 - 05:55 AM

I've never played with a simple database that big before :).

By the way, if anyone is importing it into a database, theres a typo on line 1171533 in file 08 (clickurl="http://www.breedersclub.net"): the ItemRank should be "6". (really!!) :D

John

#13 Mano70

Mano70

    Mach 1 Member

  • Members
  • 256 posts

Posted 08 August 2006 - 06:03 AM

As mentioned on another forum I'm looking at it. Have so far imported three files into Access, which means 10853830 lines of data. Have missed some data during the import, but not much.

#14 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 August 2006 - 06:06 AM

Hi Mano
You're going to run out of room. What's the limit on Access DBs? (isn't it 1GB? 2?) my MS-SQL db is now around 6GB with lots of indexes on it. I hope I can verify that I have all the data in little bit and perhaps I can put it online for people to play with (once I get the fire extinguisher out of the other room, my little server says "ouch" already) :D

John

#15 Mano70

Mano70

    Mach 1 Member

  • Members
  • 256 posts

Posted 08 August 2006 - 06:38 AM

I think you may be correct, but I just got errors when I tried to export to MS-SQL so I gave it a shot. Access has now passed 2GB, and are still working, although it's complaining much.

If I don't have any success with this it would be interesting to get the MS-SQL db, I would like to see if SPSS is able to do some fun with the data.

Edit: Access will not chew more info now, stopped at 2097088Kb.

Edited by Mano70, 08 August 2006 - 06:41 AM.


#16 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 08 August 2006 - 07:06 AM

Um... Guys, why don't you use grep to extract the relevant lines first, and then export the much smaller subset into Access or SQL or whatever? Much faster and more robust :D

#17 Mano70

Mano70

    Mach 1 Member

  • Members
  • 256 posts

Posted 08 August 2006 - 07:15 AM

We wan't it all! :D

To me which keywords etc. isn't a big deal, I'm more interesting in behaviour. Nadir points out a few.

Exporting to MS-SQL now seems to run smooth (7 files so far).

Edited by Mano70, 08 August 2006 - 08:06 AM.


#18 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 August 2006 - 08:08 AM

I'm fully databased :-)

currently creating sub-tables for words and more ... and coding lots of mistakes with my mind on "non-work" (like this) :D

#19 Mano70

Mano70

    Mach 1 Member

  • Members
  • 256 posts

Posted 08 August 2006 - 08:23 AM

Fully databased also. Creating a backup now, and parks this for some hours.

I got 36 389 567 rows.

Edited by Mano70, 08 August 2006 - 08:26 AM.


#20 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 08 August 2006 - 10:02 AM

An interesting sideline. Sometime back, while researching some spam, I noticed that a lot of the links that Google was spitting out were from AOL's search engine cache! I was amazed that Google was doing that. They have been doing it not only with AOL, but also with other big sites that use Google technology for the web. I will look if I can find the link and if I do I will post it here. That was one of the methods the Moldanian Black hat used either willingly or unwillingly to propagate links!

I wonder if the words 't1ps2see' are in that database? :D Can anyone check?

Yannis




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users