Parsing The Armory (Solution)

Suggestions for WarcraftRealms.com
Post Reply
Ronnove

Parsing The Armory (Solution)

Post by Ronnove »

Hi,

I'm a developer and I wrote a perl/bash script that connects to the official WoW armory and downloads XML files with details about each guild and players in each guild. The current process takes about 24 hours to run the guild checker script. This is how it works:

There is one flat file which has definitions of what to search for. I generated a list of guilds on the American servers:

Code: Select all

[ronnie@ronalddove01 advanced]$ cat hitlist | wc -l
24543
[ronnie@ronalddove01 advanced]$ 
The list looks like this: <realm> <guild> separated with new lines for each entry.

Phase 1: The python script crawls through the armory looking for every guild and downloads the member list of each guild. It stores a database of players based on what class they play. This process takes 24 hours to download every player from the 24k guild list. For example this generated about 100k of night elf warriors of the level 80 of female status for the entire list.

Phase 2: Downloads every player from the new generated lists in to a directory. This can be used for parsing statistical details about reputation etc. This process takes 72 hours or more based on how many downloads are required.

These processes could be ran once a month for a better idea of statistical details.

New technology bot: Will download entire armory by linking players to players and roaming the entire WoW armory until it gets every players details.

The only problem I see with the bot is if blizzard bans it from downloading this much data.

Code: Select all

&#91;ronnie@ronalddove01 archive&#93;$ ls | wc -l
12258
&#91;ronnie@ronalddove01 archive&#93;$ du -h 
2.3G    .
&#91;ronnie@ronalddove01 archive&#93;$ 
As you can see... it downloaded 2.3GB of guild data and that is just a guild parse example from a BETA test run.

Ronald Dove
http://www.dovestech.com

User avatar
Rollie
Site Admin
Posts: 4783
Joined: Sun Nov 28, 2004 11:52 am
Location: Austin, TX
Contact:

Post by Rollie »

Are you offering to share your work?

I had done some armory crawling, but ran into issues with pulling data. If you were able to crawl the entire armory in 24 hours, then this roadblock is no longer in place.
phpbb:phpinfo()

Hybuir
Gear Dependent Squirrel
Gear Dependent Squirrel
Posts: 1471
Joined: Tue Sep 06, 2005 6:22 am
Location: Austin, TX
Contact:

Post by Hybuir »

prelude to the montage?

Ronnove

Post by Ronnove »

Hi,

I want to share my scripts with you in the near future. I am working on making them more user friendly and more efficient. I want to first explain the limitations that I am sure your aware of. The 24 hour thing was only for a specific group of characters (80 only of Warrior classes within the full warcraftrealms.com guild list). However, I am working towards having a bot that crawls through the site as long as it has entry points like nick name's. This is the diagram of the process:

1.) Bot reads a flatfile line by line. Each line has a unique nick name which must be populated by someone or something. It could just be a generic dictionary list to get the bot some "entry points" into the armory crawl. ** Future versions would be a MySQL Database **

2.) The bot does a wget on the search result and parses out guild names and creates another new flatfile with just guild names and realm names, one entry per line. It first makes sure theres no duplicate entry already. It then downloads each nick name that had a match in the search results. This includes a reputation page and profile page in its current state. I guess we could have it pull anything about the character by sorting through the XML/HTML and putting into a database.

3.) The bot processes the generated guild flatfile and finds each guilds member list to download every member from the guild.

4.) Future versions of the bot will branch out to pvp groups to grab those characters.

5.) There is an flag that tells the bot to do a full "refresh" which will basically check the same chars for updates. There is also a flag to skip chars it already downloaded. However it will always refresh guild lists for updates.

Basically the limitation is from the entry point file which has a list of generic names, its possible the bot could miss a character completely off the list / radar. But it should be able to get a very big chunk of data collection. The other issue is taking the data from the XML and storing it into a database. Its all in xml so very possible to just filter out details of a character.

Its a lot of work but i think its worth it if blizzard is going to allow us to piggy back off its armory database. I sent an email to blizzard asking about the ability for me to have a bot that crawls through the armory site. Google does it to regular websites, i dont see why we cant do it here.
Last edited by Ronnove on Tue Dec 29, 2009 7:40 pm, edited 1 time in total.

Ronnove

Post by Ronnove »

We could even have other entry points like the forums. Maybe the bot refreshes the forums and parses nick names from that? If its not a duplicate, it adds it to the "entry point" nick name list as to where the bot begins to do its crawl.

Code: Select all

&#91;root@ronalddove test&#93;# wget "http&#58;//forums.worldofwarcraft.com/board.html?forumId=11112&sid=1"
&#91;root@ronalddove test&#93;# ls
board.html?forumId=11112&sid=1
&#91;root@ronalddove test&#93;# rm board.html\?forumId\=11112\&sid\=1 
rm&#58; remove regular file `board.html?forumId=11112&sid=1'? y
&#91;root@ronalddove test&#93;# wget "http&#58;//forums.worldofwarcraft.com/board.html?forumId=11112&sid=1"
--2009-12-29 20&#58;39&#58;26--  http&#58;//forums.worldofwarcraft.com/board.html?forumId=11112&sid=1
Resolving forums.worldofwarcraft.com... 12.129.242.24
Connecting to forums.worldofwarcraft.com|12.129.242.24|&#58;80... connected.
HTTP request sent, awaiting response... 200 OK
Length&#58; unspecified &#91;text/html&#93;
Saving to&#58; `board.html?forumId=11112&sid=1'

    &#91;  <=>                                                                               &#93; 110,752      283K/s   in 0.4s    

2009-12-29 20&#58;39&#58;26 &#40;283 KB/s&#41; - `board.html?forumId=11112&sid=1' saved &#91;110752&#93;

&#91;root@ronalddove test&#93;# grep t3 board.html\?forumId\=11112\&sid\=1  -A 2
                        <td class="t3">
                                <span>
                                <span>Syndri</span>
--
                        <td class="t3">
                                <span>
                                <span>Malkorix</span>
--
                        <td class="t3">
                                <span>
                                <span>Auryk</span>
--
                        <td class="t3">
                                <span>
                                <span>Vrakthris</span>
--
                        <td class="t3">
                                <span>
                                <span>Syndri</span>
--
                        <td class="t3">
Lolone
                        </td>
--
                        <td class="t3">
Greybeard
                        </td>
--
                        <td class="t3">
Trunkz
                        </td>
--
                        <td class="t3">
Lalauris
                        </td>
--
                        <td class="t3">
Numrouno
                        </td>
--
                        <td class="t3">
Lavitra
                        </td>
--
                        <td class="t3">
Twinkulater
                        </td>
--
                        <td class="t3">
Seriku
                        </td>
--
                        <td class="t3">
Soraspar
                        </td>
--
                        <td class="t3">
Demoralize
                        </td>
--
                        <td class="t3">
Marzee
                        </td>
--
                        <td class="t3">
Ordonn
                        </td>
--
                        <td class="t3">
Amerthan
                        </td>
--
                        <td class="t3">
Nebelhexa
                        </td>
--
                        <td class="t3">
Tzhaarmejjar
                        </td>
--
                        <td class="t3">
Khalidon
                        </td>
--
                        <td class="t3">
Ruthlessxaza
                        </td>
--
                        <td class="t3">
Gingolx
                        </td>
--
                        <td class="t3">
Mazadi
                        </td>
--
                        <td class="t3">
Eleye
                        </td>
--
                        <td class="t3">
Poolparty
                        </td>
--
                        <td class="t3">
Def?rge
                        </td>
--
                        <td class="t3">
Mercillus
                        </td>
--
                        <td class="t3">
Griimig
                        </td>
--
                        <td class="t3">
Blakken
                        </td>
&#91;root@ronalddove test&#93;# 

User avatar
FuxieDK
Census Taker
Posts: 659
Joined: Thu May 22, 2008 11:36 am
Location: Copenhagen, DK

Post by FuxieDK »

Remember that european forums use http://forums.wow-europe.com/
Doing census mainly on Draenor; Raluf - Nimsay - Lusmo - Quixx - Sosyan - Garthog - Trubin - Zalistra - Zesmi and Djaang

User avatar
Rollie
Site Admin
Posts: 4783
Joined: Sun Nov 28, 2004 11:52 am
Location: Austin, TX
Contact:

Post by Rollie »

Getting entry points is not a problem. My problem was with crawling the data in a timely fashion.
phpbb:phpinfo()

Ronnove

Post by Ronnove »

yeah its pretty slow, i got about 500,000 players with meta data stats. its been two weeks almost running. i got about 2,237,327 million player server listings without meta data. i got about 88,000 guilds. do want my results? are you interested in seeing the source code? i'm going to go ahead and release you some of the stuff i got.

**link removed** -- 2 million+ valid player and server names. Careful its 44MB, it might crash your browser if you view it in firefox. right click and save as.

http://www.dovestech.com/wow/ searchable MySQL DB (its a little slow with results and does not have every player in it yet, i was working on populating it). in fact i dont even recommend using it yet.

anyways it was worth a shot and i learned a lot.

NOTE: USA armory only right now

User avatar
Rollie
Site Admin
Posts: 4783
Joined: Sun Nov 28, 2004 11:52 am
Location: Austin, TX
Contact:

Post by Rollie »

I took down your db link, it can make it easier for in game spammers to use that list to start spamming people.

I would be interested in taking a peek at your crawler. Feel free to email me, rollie (at) warcraftrealms.com
phpbb:phpinfo()

Post Reply