Querying Wikipedia Data
Recently I wrote a blog about the stub article length of Wikipedia articles. I mentioned the difference in actual number of characters and the number of bytes used to define stub articles between English and Indian language Wikipedias. One can open any language Wikipedia, type Special:ShortPages in the search box to get the list of articles which have less than 2048 bytes.
Originally published blog can be accessed here.
However, as already mentioned in that blog, the number of bytes for Indian languages to be considered as stub should be actually 2048*3 = 6144 bytes employing the same criteria. How to find the list of articles fulfilling this condition?
This brings us to the topic of querying Wikipedia data. Wikimedia Foundation Labs has put up a website wherein one can run SQL queries on Wikimedia data. The URL of the website is quarry.wmflabs.org. When we open the website, we get a textbox wherein one can type the SQL query which will run on Wikimedia data. In this example I will consider Wikipedia only. But the queries can be run on the data of other Wikimedia projects like Wikisource, Wikidata, Wiktionary, etc.
One has to login with his/her Wikimedia login. After logging the SQL query can be typed in the textbox and the Submit Query button has to be clicked. The result of execution of the query on Wikimedia data will be displayed. In this blog I will throw more light on Kannada Wikipedia. The database for Kannada Wikipedia is called knwiki_p. Complete list of databases can be obtained by running the SQL query “show databases”.
To get the list of tables in Kannada Wikipedia, the following SQL queries have to be executed:
use knwiki_p;
show tables;
To know the scheme of any table, run the query desc <tablename>;. For example, to know the details of the table by name, issue the query desc page;. The fields which are of importance in the current case is page_title and page_len. The following query will list all articles in Kannada Wikipedia which are having less than 6144 bytes.
use knwiki_p;
select page_title, page_len
from page where page_len < ‘6144’ and page_namespace = 0 and page_is_redirect = 0 order by page_len ;
Some other useful queries are listed below:
Query | What it does |
---|---|
Select Count(*) from page where page_namespace = 0 and page_is_redirect =0; | Number of articles without redirect |
Select Count(*) from page where page_namespace = 0 and page_is_redirect =0 and page_len < 6144; | Number of articles which are having bytes less than 6144 |
select * from user where user_name Like “P%”; | List all users whose username starts with letter “P” |
select user_id, user_name, user_editcount from user where user_editcount >3000 order by user_editcount desc; | List all users with editcount more than 3000 |
SELECT
page_namespace, page_title, page_len FROM page WHERE page_len > 175000 AND page_title NOT LIKE “%/%” ORDER BY page_namespace ASC; |
List of long articles (articles having bytes more than 175000) |
SELECT rc_title as title, rc_comment as comments, count(*) as Edits
FROM recentchanges WHERE rc_namespace = 0 GROUP BY 1 ORDER BY 3 DESC LIMIT 100; |
Most edited 100 pages during past one month |
SELECT log_title, COUNT(*) FROM logging WHERE log_type=”thanks” GROUP BY log_title ORDER BY COUNT(*) DESC LIMIT 100; | Who have been thanked most |
Useful links -