Parsing a Wikipedia dump

Posted on

Question :

Parsing a Wikipedia dump

For example using this Wikipedia dump:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=lebron%20james&rvprop=content&redirects=true&format=xmlfm

Is there an existing library for Python that I can use to create an array with the mapping of subjects and values?

For example:

{height_ft,6},{nationality, American}
Asked By: tomwu

||

Answer #1:

It looks like you really want to be able to parse MediaWiki markup. There is a python library designed for this purpose called mwlib. You can use python’s built-in XML packages to extract the page content from the API’s response, then pass that content into mwlib’s parser to produce an object representation that you can browse and analyse in code to extract the information you want. mwlib is BSD licensed.

Answered By: chaos95

Answer #2:

Just stumbled over a library on PyPi, wikidump, that claims to provide

Tools to manipulate and extract data from wikipedia dumps

I didn’t use it yet, so you are on your own to try it…

Answered By: PhilS

Answer #3:

I described how to do this using a combination of pywikibot and mwparserfromhell in this post (don’t have enough reputation yet to flag as a duplicate).

In [1]: import mwparserfromhell

In [2]: import pywikibot

In [3]: enwp = pywikibot.Site('en','wikipedia')

In [4]: page = pywikibot.Page(enwp, 'Waking Life')            

In [5]: wikitext = page.get()               

In [6]: wikicode = mwparserfromhell.parse(wikitext)

In [7]: templates = wikicode.filter_templates()

In [8]: templates?
Type:       list
String Form:[u'{{Use mdy dates|date=September 2012}}', u"{{Infobox filmn| name           = Waking Lifen| im <...> critic film|waking-life|Waking Life}}', u'{{Richard Linklater}}', u'{{DEFAULTSORT:Waking Life}}']
Length:     31
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items

In [10]: templates[:2]
Out[10]: 
[u'{{Use mdy dates|date=September 2012}}',
 u"{{Infobox filmn| name           = Waking Lifen| image          = Waking-Life-Poster.jpgn| image_size     = 220pxn| alt            =n| caption        = Theatrical release postern| director       = [[Richard Linklater]]n| producer       = [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer Westn| writer         = Richard Linklatern| starring       = [[Wiley Wiggins]]n| music          = Glover Gilln| cinematography = Richard Linklater<br />[[Tommy Pallotta]]n| editing        = Sandra Adairn| studio         = [[Thousand Words]]n| distributor    = [[Fox Searchlight Pictures]]n| released       = {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}n| runtime        = 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>n| country        = United Statesn| language       = Englishn| budget         =n| gross          = $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>n}}"]

In [11]: infobox_film = templates[1]

In [12]: for param in infobox_film.params:
             print param.name, param.value

 name             Waking Life

 image            Waking-Life-Poster.jpg

 image_size       220px

 alt             

 caption          Theatrical release poster

 director         [[Richard Linklater]]

 producer         [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West

 writer           Richard Linklater

 starring         [[Wiley Wiggins]]

 music            Glover Gill

 cinematography   Richard Linklater<br />[[Tommy Pallotta]]

 editing          Sandra Adair

 studio           [[Thousand Words]]

 distributor      [[Fox Searchlight Pictures]]

 released         {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}

 runtime          101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>

 country          United States

 language         English

 budget          

 gross            $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>

Don’t forget that params are mwparserfromhell objects too!

Answered By: notconfusing

Answer #4:

I know the question is old, but I was searching for a library that parses wikipedia xml dump. However, the suggested libraries, wikidump and mwlib, don’t offer many code documentation. Then, I found Mediwiki-utilities, which has some code documentation in: http://pythonhosted.org/mediawiki-utilities/.

Answered By: Evelin Amorim

Answer #5:

WikiExtractor appears to be a clean, simple, and efficient way to do this in Python today: https://github.com/attardi/wikiextractor

It provides an easy way to parse a Wikipedia dump into a simple file structure like so:

<doc>...</doc>
<doc>...</doc>
...
<doc>...</doc>

…where each doc looks like:

<doc id="2" url="http://it.wikipedia.org/wiki/Harmonium">
Harmonium.
L'harmonium รจ uno strumento musicale azionato con una tastiera, detta manuale.
Sono stati costruiti anche alcuni harmonium con due manuali.
...
</doc>
Answered By: legel

Answer #6:

There’s some information on Python and XML libraries here.

If you’re asking is there an existing library that’s designed to parse Wiki(pedia) XML specifically and match your requirements, this is doubtful. However you can use one of the existing libraries to traverse the DOM and pull out the data you need.

Another option is to write an XSLT stylesheet that does similar and call it using lxml. This also lets you make calls to Python functions from inside the XSLT so you get the best of both worlds.

Answered By: imoatama

Answer #7:

I know this is an old question, but I here is this great script that reads the wiki dump xml and outputs a very nice csv:

PyPI: https://pypi.org/project/wiki-dump-parser/

GitHub: https://github.com/Grasia/wiki-scripts/tree/master/wiki_dump_parser

Answered By: Stian

Answer #8:

You’re probably looking for the Pywikipediabot for manipulating the wikipedia API.

Answered By: Eugene

Leave a Reply

Your email address will not be published. Required fields are marked *