Reply to post

Wikipedia

Author
j001
User
  • Total Posts : 0
  • Reward points: 0
  • Joined: 2019/07/31 03:32:38
  • Status: offline
2019/08/01 02:42:52 (permalink)

Wikipedia

Hi,
1a.
I'd like to ask you how to properly extract readable content from wikipedia. I tried to follow an example here (https://hyperionics.com/forum2/tm.aspx?m=11939), but it doesn't seem to work.
I inserted lines
"url": "https?://en\\.wikipedia\\.org/",
and in nodeRemove
{ "tag": "div", "attrib": [{ "name": "id", "val": "toc" }]},
but Table of contents is still extracted. Am I doing something wrong?
 
1b. There is also possibility to transform url. For example when sharing url https://en.wikipedia.org/wiki/Scotland to @Voice I'd like to transform it to https://en.wikipedia.org/w/api.php?format=xmlfm&action=query&prop=extracts&redirects=true&explaintext&exsectionformat=plain&exlimit=1&titles=Scotland and import an "extract" tag.
or possibly transform it to https://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&redirects=true&exlimit=1&titles=Scotland and import html page in "extract" tag.
Could this be done?
 
First option would be better as I could keep images in article and remove other content, although I need to learn how to use your method I'm not familiar with this language, I know how to use XPath (in Visual Web Ripper), this would be great if a user could specify XPath filters in @Voice for specific websites for content which should be removed.
 
1c. I think wikipedia is such an important site, it would be worth if it had own settings in @Voice (eg. Remove images; Remove image captions (keep images); remove table of content; remove tables; remove content below headings See also|References|External links.....; use api with selected parameters etc). Just an idea.
 
2.
In edit speech there is a TAGS option which I don't know how to use, is there some kind of manual somewhere? I couldn't find any. Does it have something to do with SSML tags?
 
Thank you.

7 Replies Related Threads

    Admin
    Administrator
    • Total Posts : 275
    • Reward points: 0
    • Joined: 2010/11/22 00:00:00
    • Location: USA
    • Status: offline
    Re: Wikipedia 2019/08/02 05:11:33 (permalink)
    Maybe if you would send me by email the exact filter file you created, I could detect errors in it and send back corrections. What is wrong with the default text extraction from Wikipedia that @Voice app does?
     
    Tags in speech replacements are only for management purposes. You could e.g. have one set of replacements created for reading fiction, another for scientific or technical articles etc. - then enable or disable them quickly with a tag (select all replacements with say "science" tag and disable them, or enable them etc.)
    post edited by Admin - 2019/08/02 05:13:33
    j001
    User
    • Total Posts : 0
    • Reward points: 0
    • Joined: 2019/07/31 03:32:38
    • Status: offline
    Re: Wikipedia 2019/08/03 07:41:35 (permalink)
    Well, there are plenty of issues, for example I'd like to remove tables, image captions, certain sections (like References, See Also, Further Reading, etc), or headings with [edit] link on the right side are fused together (History [edit] becomes Historyedit). More examples could be found surely.
    I've just sent you an email with the filter, thank you.
    Admin
    Administrator
    • Total Posts : 275
    • Reward points: 0
    • Joined: 2010/11/22 00:00:00
    • Location: USA
    • Status: offline
    Re: Wikipedia 2019/08/03 07:45:47 (permalink)
    The filter file you've sent me works fine. The problem is probably: in which folder to you place this .json filter file? User created filters should go into Filters sub-folder of the main @Voice data folder (usually something like .../Android/data/com.hyperionics.avar/files, although it may be moved elsewhere with Settings). The .config/filters folder is only for filters downloaded from my web site, and files you place there will be ignored.
     
    Greg
    j001
    User
    • Total Posts : 0
    • Reward points: 0
    • Joined: 2019/07/31 03:32:38
    • Status: offline
    Re: Wikipedia 2019/08/03 08:59:24 (permalink)
    Yes I had it in the other folder. Now it works, great. And what kind of language is this? Can I find a list of commands/parameters/attributes somewhere? Or where can I download your filters (I can see some in webcfg.txt), maybe I can learn something from them.
    Thank you.
    Admin
    Administrator
    • Total Posts : 275
    • Reward points: 0
    • Joined: 2010/11/22 00:00:00
    • Location: USA
    • Status: offline
    Re: Wikipedia 2019/08/03 10:08:07 (permalink)
    The app downloads filter files as needed, for example if I had a filter file for wikipedia.org site, it would get downloaded when the user first opens a link to wikipedia.org in @Voice app. I don't have any documentation for this filter format, it is my creation and I extend it as needed. I have a "notes to myself" file about it. I'll paste it here:
    {
    "extractor": "default|full|Readability",

    "file": "(.*/vesper/.+)|(:pasted:)", OR
    "url": "https?://fee\\.org/articles/",
    "sampleLink": "https://www.something.com/article_link...",
    "siteType": "mobile|desktop" // mobile, desktop, or system default if not set
    "userAgent": "@|literal string|absent-use code default" // @ means use WebView user agent, or string like "Mozilla/5.0 (Linux; Android 6.0.1; vivo 1603 Build/MMB29M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.83 Mobile Safari/537.36"
    "title": "Some title...",
    ...

    "waitFor": { // NOT IMPLEMENTED YET!
    "waitBefore": 100, // ms to wait before using selector, default 0
    "selector": "div[class^=\\"starts_with--\\"]",
    "index": 1, // 0 is default, use document.querySelector() else .querySelectorAll()
    "condition": "present|absent|vis|invis|hi0|hi9" // present is default, hi9 means height is > 0
    "maxWait": 5000, // ms max time to wait for condition, default 5000
    "waitAfter": 200 // ms to wait after condition satisfied, default 0
    },

    "nextPgLink": { // works only with Readability extractor!
    "selector": "a[class=\\"fl-rt\\"]",
    "index": 0, // 0 is default, use document.querySelector() else .querySelectorAll()
    "maxPgs": 0 // max number of pages to process, default 0 - unlimited
    },

    "readMoreBtn": { // all optional, but must identify a button uniquely
    "waitTime": 1000, // wait for XX milliseconds, before doing anything else
    "btnTag": "button", // usually button, may be something else
    "btnText": "read more",
    "btnId": "some_id",
    "btnClass": "some_class",
    "parentTag": "div"
    "parentId": "some_id"
    "parentClass": "some_class"
    "scrollTimes": 8,
    "numToPress": 1, // 0: press all, 1 only first, 2 - only 1st and 2nd etc., default 1
    "fgOnly": true // if true, won't call moveTaskToBack(true); in ExtractBrowserActivity.java, when called with inBackground: true attribute
    },

    "nodeAdd": [
    { "tag": "div", "times": 1, "attrib": [{ "name": "class", "val": "article__wrapper" }]} // times optional, default 0 - adds all elements found
    ],

    "nodeRemove": [
    { "tag": "div", "times": 2, "attrib": [{ "name": "class", "val": "wrapper__mobile-wide-ad-container.*" }]} // times optional as above
    ],

    "appendHtml": [
    { "text": "<p avar_='stop'><i>To load more answers from Quora, press the &quot;Reload or clear&quot; button on top (circular arrows), then pres &quot;Load from browser...&quot;. Next scroll the page as much as you want, and finally press the loudspeaker button at bottom-right.</i></p>" }
    ],
     
     
     
    "edit": [ // text edit - replace or remove sentences
    {
    "repeat": true,
    "from": "^\\s*oder\\s*$",
    "until": ":PAR2",
    "replace": ""
    }
    }

    "file" or "url"
    Can also contain ":pasted:" to process directly pasted text line in the RegEx above.

    "readMoreBtn"
    Must provide at least a unique btnId, or btnTag + btnText or btnClass (or both)
    parentTag and parentClass or parentId may be provided to narrow the button search.
     
     
     
    "edit"
      "until"

    :PAR2 - delete 2 paragraphs, the current one + one more after it etc.
    :PAR-2 - delete the current paragraph + one before it etc.
    :END: - delete everything from the current sentence until the end.
    or RegEx for the last sentence to be deleted/replaced

    "from"
    "" - empty means from the very top of the text
    or RegEx for the first sentence to be deleted/replaced

    post edited by Admin - 2019/08/03 10:14:36
    j001
    User
    • Total Posts : 0
    • Reward points: 0
    • Joined: 2019/07/31 03:32:38
    • Status: offline
    Re: Wikipedia 2019/08/07 11:51:12 (permalink)
    So I made this filter for Wikipedia:

    {
    "extractor": "full",
    "disabled": false,
    "url": "https?://en\\.wikipedia\\.org/",
    "edit": [
    { "repeat": false, "from": "^(See also|Notes|Footnotes|Images|Gallery|References|Bibliography|External links|Further reading)$", "until": ":END:", "replace": "" }
    ],
    "nodeAdd": [
    { "tag": "div", "attrib": [{ "name": "id", "val": "mw-content-text" }]},
    ],
    "nodeRemove": [
    { "tag": "div", "attrib": [{ "name": "id", "val": "toc" }]},
    { "tag": "div", "attrib": [{ "name": "class", "val": "gallerytext" }]},
    { "tag": "div", "attrib": [{ "name": "class", "val": "thumbcaption" }]},
    { "tag": "div", "attrib": [{ "name": "role", "val": "note" }]},
    { "tag": "span", "attrib": [{ "name": "class", "val": "mw-editsection" }]},
    { "tag": "sup", "attrib": [{ "name": "class" }]},
    { "tag": "table" }]}
    ]
    }

     
    I have a few problems here:
    1. regex in "edit" parameter doesn't work, specifically ^ and $ (beginning and end of sentence), if these are removed, it works, but it is useless as it could remove any content, not just everything below those headings.
    2. "edit" parameter doesn't work if put at the bottom (behind "nodeRemove") - try to remove ^ and $ and put it there. Nothing will change.
     
    3. Also, in Edit speech options, in my RegEx formulas is space ignored, even if I use \s
     
    Is there a way how to remove everything below (See also|Notes|Footnotes|Images|Gallery|References|Bibliography|External links|Further reading) headings on Wikipedia pages?
    Thank you.
    Admin
    Administrator
    • Total Posts : 275
    • Reward points: 0
    • Joined: 2010/11/22 00:00:00
    • Location: USA
    • Status: offline
    Re: Wikipedia 2019/08/08 04:40:52 (permalink)
    Your Edit regex probably does not match the actual contents of what is in the text, after the DOM filters (nodeAdd, nodeRemove) are done. There may be some HTML code, not visible but present within the text. You could save the HTML text after extraction and look what is exactly there. For example, instead of a space there may be &nbsp; code, and a lot more.
    Jump to:
    © 2024 APG vNext Commercial Version 5.1