WarcraftReamls.com
  FAQFAQ    SearchSearch    MemberlistMemberlist    UsergroupsUsergroups   RegisterRegister 
  ProfileProfile    Log in to check your private messagesLog in to check your private messages    Log inLog in 
Parsing The Armory (Solution)

 
Post new topic   Reply to topic    WarcraftRealms.com Forum Index -> Suggestions
View previous topic :: View next topic  
Author Message
Ronnove



Joined: 25 Dec 2009
Posts: 4

WR Updates: 0
Ronnove WR Profile

PostPosted: Fri Dec 25, 2009 8:02 pm    Post subject: Parsing The Armory (Solution) Reply with quote

Hi,

I'm a developer and I wrote a perl/bash script that connects to the official WoW armory and downloads XML files with details about each guild and players in each guild. The current process takes about 24 hours to run the guild checker script. This is how it works:

There is one flat file which has definitions of what to search for. I generated a list of guilds on the American servers:

Code:
[ronnie@ronalddove01 advanced]$ cat hitlist | wc -l
24543
[ronnie@ronalddove01 advanced]$


The list looks like this: <realm> <guild> separated with new lines for each entry.

Phase 1: The python script crawls through the armory looking for every guild and downloads the member list of each guild. It stores a database of players based on what class they play. This process takes 24 hours to download every player from the 24k guild list. For example this generated about 100k of night elf warriors of the level 80 of female status for the entire list.

Phase 2: Downloads every player from the new generated lists in to a directory. This can be used for parsing statistical details about reputation etc. This process takes 72 hours or more based on how many downloads are required.

These processes could be ran once a month for a better idea of statistical details.

New technology bot: Will download entire armory by linking players to players and roaming the entire WoW armory until it gets every players details.

The only problem I see with the bot is if blizzard bans it from downloading this much data.

Code:
[ronnie@ronalddove01 archive]$ ls | wc -l
12258
[ronnie@ronalddove01 archive]$ du -h
2.3G    .
[ronnie@ronalddove01 archive]$


As you can see... it downloaded 2.3GB of guild data and that is just a guild parse example from a BETA test run.

Ronald Dove
http://www.dovestech.com
Back to top
View user's profile Send private message Visit poster's website
Rollie
Site Admin


Joined: 28 Nov 2004
Posts: 5374
Location: Austin, TX
WR Updates: 480,131
Rollie WR Profile

PostPosted: Sun Dec 27, 2009 12:12 pm    Post subject: Reply with quote

Are you offering to share your work?

I had done some armory crawling, but ran into issues with pulling data. If you were able to crawl the entire armory in 24 hours, then this roadblock is no longer in place.
Back to top
View user's profile Send private message Visit poster's website
Hybuir
Gear Dependent Squirrel
Gear Dependent Squirrel


Joined: 06 Sep 2005
Posts: 1538
Location: Austin, TX
WR Updates: 2,614,012
Hybuir WR Profile

PostPosted: Mon Dec 28, 2009 8:29 am    Post subject: Reply with quote

prelude to the montage?
_________________

Back to top
View user's profile Send private message Visit poster's website AIM Address
Ronnove



Joined: 25 Dec 2009
Posts: 4

WR Updates: 0
Ronnove WR Profile

PostPosted: Tue Dec 29, 2009 7:28 pm    Post subject: Reply with quote

Hi,

I want to share my scripts with you in the near future. I am working on making them more user friendly and more efficient. I want to first explain the limitations that I am sure your aware of. The 24 hour thing was only for a specific group of characters (80 only of Warrior classes within the full warcraftrealms.com guild list). However, I am working towards having a bot that crawls through the site as long as it has entry points like nick name's. This is the diagram of the process:

1.) Bot reads a flatfile line by line. Each line has a unique nick name which must be populated by someone or something. It could just be a generic dictionary list to get the bot some "entry points" into the armory crawl. ** Future versions would be a MySQL Database **

2.) The bot does a wget on the search result and parses out guild names and creates another new flatfile with just guild names and realm names, one entry per line. It first makes sure theres no duplicate entry already. It then downloads each nick name that had a match in the search results. This includes a reputation page and profile page in its current state. I guess we could have it pull anything about the character by sorting through the XML/HTML and putting into a database.

3.) The bot processes the generated guild flatfile and finds each guilds member list to download every member from the guild.

4.) Future versions of the bot will branch out to pvp groups to grab those characters.

5.) There is an flag that tells the bot to do a full "refresh" which will basically check the same chars for updates. There is also a flag to skip chars it already downloaded. However it will always refresh guild lists for updates.

Basically the limitation is from the entry point file which has a list of generic names, its possible the bot could miss a character completely off the list / radar. But it should be able to get a very big chunk of data collection. The other issue is taking the data from the XML and storing it into a database. Its all in xml so very possible to just filter out details of a character.

Its a lot of work but i think its worth it if blizzard is going to allow us to piggy back off its armory database. I sent an email to blizzard asking about the ability for me to have a bot that crawls through the armory site. Google does it to regular websites, i dont see why we cant do it here.


Last edited by Ronnove on Tue Dec 29, 2009 7:40 pm; edited 1 time in total
Back to top
View user's profile Send private message Visit poster's website
Ronnove



Joined: 25 Dec 2009
Posts: 4

WR Updates: 0
Ronnove WR Profile

PostPosted: Tue Dec 29, 2009 7:35 pm    Post subject: Reply with quote

We could even have other entry points like the forums. Maybe the bot refreshes the forums and parses nick names from that? If its not a duplicate, it adds it to the "entry point" nick name list as to where the bot begins to do its crawl.

Code:

[root@ronalddove test]# wget "http://forums.worldofwarcraft.com/board.html?forumId=11112&sid=1"
[root@ronalddove test]# ls
board.html?forumId=11112&sid=1
[root@ronalddove test]# rm board.html\?forumId\=11112\&sid\=1
rm: remove regular file `board.html?forumId=11112&sid=1'? y
[root@ronalddove test]# wget "http://forums.worldofwarcraft.com/board.html?forumId=11112&sid=1"
--2009-12-29 20:39:26--  http://forums.worldofwarcraft.com/board.html?forumId=11112&sid=1
Resolving forums.worldofwarcraft.com... 12.129.242.24
Connecting to forums.worldofwarcraft.com|12.129.242.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `board.html?forumId=11112&sid=1'

    [  <=>                                                                               ] 110,752      283K/s   in 0.4s   

2009-12-29 20:39:26 (283 KB/s) - `board.html?forumId=11112&sid=1' saved [110752]

[root@ronalddove test]# grep t3 board.html\?forumId\=11112\&sid\=1  -A 2
                        <td class="t3">
                                <span>
                                <span>Syndri</span>
--
                        <td class="t3">
                                <span>
                                <span>Malkorix</span>
--
                        <td class="t3">
                                <span>
                                <span>Auryk</span>
--
                        <td class="t3">
                                <span>
                                <span>Vrakthris</span>
--
                        <td class="t3">
                                <span>
                                <span>Syndri</span>
--
                        <td class="t3">
Lolone
                        </td>
--
                        <td class="t3">
Greybeard
                        </td>
--
                        <td class="t3">
Trunkz
                        </td>
--
                        <td class="t3">
Lalauris
                        </td>
--
                        <td class="t3">
Numrouno
                        </td>
--
                        <td class="t3">
Lavitra
                        </td>
--
                        <td class="t3">
Twinkulater
                        </td>
--
                        <td class="t3">
Seriku
                        </td>
--
                        <td class="t3">
Soraspar
                        </td>
--
                        <td class="t3">
Demoralize
                        </td>
--
                        <td class="t3">
Marzee
                        </td>
--
                        <td class="t3">
Ordonn
                        </td>
--
                        <td class="t3">
Amerthan
                        </td>
--
                        <td class="t3">
Nebelhexa
                        </td>
--
                        <td class="t3">
Tzhaarmejjar
                        </td>
--
                        <td class="t3">
Khalidon
                        </td>
--
                        <td class="t3">
Ruthlessxaza
                        </td>
--
                        <td class="t3">
Gingolx
                        </td>
--
                        <td class="t3">
Mazadi
                        </td>
--
                        <td class="t3">
Eleye
                        </td>
--
                        <td class="t3">
Poolparty
                        </td>
--
                        <td class="t3">
Def?rge
                        </td>
--
                        <td class="t3">
Mercillus
                        </td>
--
                        <td class="t3">
Griimig
                        </td>
--
                        <td class="t3">
Blakken
                        </td>
[root@ronalddove test]#
Back to top
View user's profile Send private message Visit poster's website
FuxieDK



Joined: 22 May 2008
Posts: 455
Location: Copenhagen, DK
WR Updates: 2,596,413
FuxieDK WR Profile

PostPosted: Wed Dec 30, 2009 4:07 am    Post subject: Reply with quote

Remember that european forums use http://forums.wow-europe.com/
_________________
Doing census on various servers Wink
Back to top
View user's profile Send private message
Rollie
Site Admin


Joined: 28 Nov 2004
Posts: 5374
Location: Austin, TX
WR Updates: 480,131
Rollie WR Profile

PostPosted: Tue Jan 05, 2010 3:19 pm    Post subject: Reply with quote

Getting entry points is not a problem. My problem was with crawling the data in a timely fashion.
Back to top
View user's profile Send private message Visit poster's website
Ronnove



Joined: 25 Dec 2009
Posts: 4

WR Updates: 0
Ronnove WR Profile

PostPosted: Wed Jan 06, 2010 12:24 am    Post subject: Reply with quote

yeah its pretty slow, i got about 500,000 players with meta data stats. its been two weeks almost running. i got about 2,237,327 million player server listings without meta data. i got about 88,000 guilds. do want my results? are you interested in seeing the source code? i'm going to go ahead and release you some of the stuff i got.

**link removed** -- 2 million+ valid player and server names. Careful its 44MB, it might crash your browser if you view it in firefox. right click and save as.

http://www.dovestech.com/wow/ searchable MySQL DB (its a little slow with results and does not have every player in it yet, i was working on populating it). in fact i dont even recommend using it yet.

anyways it was worth a shot and i learned a lot.

NOTE: USA armory only right now
Back to top
View user's profile Send private message Visit poster's website
Rollie
Site Admin


Joined: 28 Nov 2004
Posts: 5374
Location: Austin, TX
WR Updates: 480,131
Rollie WR Profile

PostPosted: Wed Jan 06, 2010 8:57 am    Post subject: Reply with quote

I took down your db link, it can make it easier for in game spammers to use that list to start spamming people.

I would be interested in taking a peek at your crawler. Feel free to email me, rollie (at) warcraftrealms.com
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    WarcraftRealms.com Forum Index -> Suggestions All times are GMT - 6 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
WarcraftRealms.com  


Powered by phpBB © 2001, 2005 phpBB Group