Functions Overview¶
This page contains the list of functions and their implementation details
Extraction Methods¶
The following functions download/crawl/extract the data from collaborative knowledge building portals. Currently we support only the mining of Wikipedia and Stack Exchange network. the future release will support the extraction and analysis of portals such as GitHub, Reddit, and Quora.
-
kdap.analysis.knol.
download_dataset
(self, sitename, **kwargs)¶ Download dataset from site
Parameters: - sitename (basestring) – Name of portal to download from e.g wikipedia, stackexchange
- **article_list (str or list[str]) – Wikipedia article(s) to download in knol-ML format
- **category_list (str or list[str]) – Single/list of wikipedia categories. When this parameter is provided, all the articles under these lists will be extracted
- **template_list (str or list[str]) – Single/list of wikipedia templates. When this parameter is provided, all the articles under these lists will be extracted
- **destdir (str) – Path to destination folder where the dataset will be downloaded
- **wikipedia_dump (str) – Path to wikipedia full dump. When provided, the articles will be extract directly from the dump
- **download (bool) – if true, the articles will be downloaded
- **portal (str) – stackexchange portal name. Provided when sitename=’stackexchange’
Returns: - **final_category_list (list[str]) – A list of wikipedia article names. Only when caregory_list is provided as argument
- **final_template_list (list[str]) – A list of wikipedia article names. Only when template_list is provided as argument
-
kdap.analysis.knol.
get_wiki_article
(self, article_name, **kwargs)¶ Downloads the full revision history of an article in knol-ML format
Parameters: - article_name (str) – Name of the article to download revision history for
- **output_dir (str, optional) – Output directory for generated knol-ML file
-
kdap.analysis.knol.
get_wiki_article_by_class
(self, **kwargs)¶ Query database to extract articles based on category or project name
Parameters: - **wikiproject (str) – Name of the wikiproject for which you want to extract the articles. Should not be specified with wiki_class
- **wiki_class (str) – The Wikipedia quality class for which the articles has to be extracted. Should not be specified with wikiproject
Returns: **articles – A list of wikipedia article names.
Return type: list[str]
-
kdap.analysis.knol.
get_instance_date
(self, *args, **kwargs)¶ Retrieve the instance dates for a list of articles. Includes multiprocessing
Parameters: - **file_list (list[str] or str) – String or List of Knol-Ml article(s)’ path
- **dir_path (str) – Path of the directory which contains desired knol-Ml files
- **c_num (int) – Number of parallel threads you want
Returns: **instance_date – A dictionary with keys as articles and values as dates
Return type: dictionary
-
kdap.analysis.knol.
get_pageviews
(self, site_name, *args, **kwargs)¶ Get pageviews for a particular article
Parameters: - site_name (str) – Site to get pageviews from
- **article_name (str) – Article to get pageviews for
- **granularity (str) – Granularity of pageviews data e.g. monthly
- **start (str) – Date to start counting pageviews from
- **end (str) – Date to count pageviews till
-
kdap.analysis.knol.
get_num_instances
(self, **kwargs)¶ Extract number of instances based on start and end dates
Parameters: - **file_list (list[str]) – List of Knol-Ml articles’ path
- **dir_path (str) – Path of the directory which contains desired knol-Ml files
- **c_num (int) – Number of parallel threads you want
- **granularity (str) – Retrieve the instances monthly or yearly
- **start (str) – Start date in YYYY-MM-DD format
- **end (str) – End date in YYYY-MM-DD format
Returns: revisionLength – A dictionary with keys as articles and values as instances
Return type: dict
-
kdap.analysis.knol.
get_editors
(self, **kwargs)¶ Extract editors based on granularity. Includes parallel processing
Parameters: - **file_list (list[str]) – List of Knol-Ml articles’ path
- **dir_path (str) – Path of the directory which contains desired knol-Ml files
- **c_num (int) – Number of parallel threads you want
- **granularity (str) – Retrieve the instances monthly or yearly
- **start (str) – Start date in YYYY-MM-DD format
- **end (str) – End date in YYYY-MM-DD format
Returns: **userList – A dictionary with keys as articles and values as editor ids
Return type: dictionary
Get the edits of particular users for articles
get_author_edits(site_name,[article_list, dir_path, editor_list, all_wiki=False]) The following function is used to get the edits of each user
Parameters: - **all_wiki (str) – if site_name = wikipedia then setting this variable True will get all the edits of the users of article
- **article_list (list[str]) – list of file names (in knolml format)
- **dir_path (str) – path of the directory where all the files are present (in knolml format)
- **editor_list (list[str]) – list of editor usernames for which edits are required
- **type (str) – type of edit to be measured e.g. bytes, edits, sentences. bytes by default
- **ordered_by (str) – means of ordering e.g. editor, questions, answers or article
Returns: author_contrib – A dictionary with keys as articles and values as author’s contribution
Return type: dict
Frame Methods¶
The following methods are used to extract the knolml articles in frames and use them to analyze each instance/revision/thread separately
-
kdap.analysis.knol.
frame
(self, **kwargs)¶ This method takes file names as an argument and returns the list of frame objects
Parameters: - **file_name (str, optional) – The name of the article for which the frame objects have to be created.
- **dir_path (str, optional) – The path of the directory containing the knolml files
- **get_bulk (bool, optional) – If this is true, all the frames are returned as a list instead of an iterator. This will require extra memory
frame
returns a generator function that sequentially yields instances
class objects to analyse each revision/thread separately
-
class
kdap.analysis.
instances
(instance, title)¶ creating the instance of each object. The init function defined stores each instance’s attribute which can be analyzed separately
-
get_bytes
()¶ Returns the bytes detail
Returns: **bytes – number of bytes given text has Return type: int
-
get_editor
()¶ Returns the edior details
Returns: **editor – Details related to the editor of this instance Return type: dictionary
-
get_score
()¶ Returns the score details
Returns: **score – A dictionary of score values, if available Return type: dictionary
Returns the tag details Works for QnA dataset
Returns: **tags – List of tags, if available Return type: list
-
get_text
(*args, **kwargs)¶ Returns the text data
Parameters: **clean (bool, optional) – Returns: **text – actual text of the instance Return type: str
-
get_text_stats
(*args, **kwargs)¶ Returns the email ids in the text
Parameters: - title (bool, optional) –
- count_words (str, optional) –
- url (str, optional) –
-
get_timestamp
()¶ Returns the timestamp details
Returns: **timestamp – Timestamp details of this instance Return type: dictionary
-
get_title
()¶ Returns the title
Returns: **title – Title of the Knowledge Data Return type: str
-
is_answer
()¶ Returns True if the instance is an answer Works with QnA based knolml dataset
Returns: **closed – Returns true if the post is an answer, if applicable Return type: bool
-
is_closed
()¶ Returns True if the qna thread is closed Works with QnA based knolml dataset
Returns: **closed – Returns true if the post is close, if applicable Return type: bool
-
is_comment
()¶ Returns True if the instance is a comment Works with QnA based knolml dataset
Returns: **closed – Returns true if the post is a comment, if applicable Return type: bool
-
is_question
()¶ Returns True if the instance is a question Works with QnA based knolml dataset
Returns: **closed – Returns true if the post is a question , if applicable Return type: bool
-
Analysis Methods¶
The following functions analysses the knol-ML dataset. For most of these methods, the dataset has to be provided in the argument.
- This method finds the similarity between the set of editors for a set of articles.
- Works on the returned dictionary by get_editors() method
Parameters: - **editors (dictionary) – A dictionary of editors returned by get_editors() method, granularity=’daily’
- **similarity (str) – Similarity measure to be measured, e.g Jaccard
Returns: **similarity – A dictionary with keys as years, months, and days and values as the similarity measures
Return type: dictionary
-
kdap.analysis.knol.
get_local_gini_coefficient
(*args, **kwargs)¶ This method finds the gini coefficient for each article/file provided in the argument.
Parameters: - **file_list (list[str]) – List of Knol-Ml articles’ path
- **dir_path (str) – Path of the directory which contains desired knol-Ml files
- **c_num (int) – Number of parallel threads you want
Returns: **local_gini – A dictionary with keys as articles and values as the gini coefficients
Return type: dictionary
-
kdap.analysis.knol.
get_global_gini_coefficient
(self, *args, **kwargs)¶ This method finds the global gini coefficient for a set of articles/files provided in the argument.
Parameters: - **file_list (list[str]) – List of Knol-Ml articles’ path
- **dir_path (str) – Path of the directory which contains desired knol-Ml files
- **c_num (int) – Number of parallel threads you want
Returns: **global_gini – A gini value for the given articles
Return type: int
Graph Methods¶
The following methods are used to create the wiki graph using the wikilinks of the articles. Users can use one of these methods to create the wiki graph according to the requirement.
-
kdap.analysis.knol.
get_induced_graph_by_articles
(self, article_names)¶ Given a list of Wikipedia article names, the function returns the adjacency list of inter-wiki links
Parameters: **article_names (list[str]) – List of Wikipedia article names Returns: **adj_list – An adjacency list of inter-wiki graph Return type: list
-
kdap.analysis.knol.
get_induced_graph_by_article
(self, article_name)¶ Given a Wikipedia article name, the function returns the adjacency list of inter-wiki links present in that article
Parameters: **article_name (str) – Wikipedia article name Returns: **adj_list – An adjacency list of inter-wiki graph Return type: list
-
kdap.analysis.knol.
get_city_graph_by_country
(self, country_name)¶ Given a country name, the function returns the adjacency list of inter-wiki links for the cities in that country
Parameters: **country_name (str) – Country name for which cities graph has to be created Returns: **adj_list – An adjacency list of inter-wiki graph Return type: list