Wikipedia | Hyperionics Support Forum

Join us now!

Username
Password
Verification
	Stay logged in

Forgot Your Password? Forgot your Username? Haven't received registration validation E-mail?

User Control Panel Log out

Forums
Posts

Latest Posts

Active Posts

Recently Visited

Search Results

View More
Blog

Recent Blog Posts

View More
Photos

Recent Photos

My Favorites

View More Photo Galleries
PMs

Unread PMs

Inbox

Send New PM View More
Page Extras
Menu
- Forum Themes

Reply to post

Mark Thread UnreadFlat Reading Mode ❐

Wikipedia

Author Post Essentials Only Full Version
j001 User Total Posts : 0 Reward points: 0 Joined: 2019/07/31 03:32:38 Status: offline 2019/08/01 02:42:52 (permalink) Wikipedia Hi, 1a. I'd like to ask you how to properly extract readable content from wikipedia. I tried to follow an example here (https://hyperionics.com/forum2/tm.aspx?m=11939), but it doesn't seem to work. I inserted lines "url": "https?://en\\.wikipedia\\.org/", and in nodeRemove { "tag": "div", "attrib": [{ "name": "id", "val": "toc" }]}, but Table of contents is still extracted. Am I doing something wrong? 1b. There is also possibility to transform url. For example when sharing url https://en.wikipedia.org/wiki/Scotland to @Voice I'd like to transform it to https://en.wikipedia.org/w/api.php?format=xmlfm&action=query&prop=extracts&redirects=true&explaintext&exsectionformat=plain&exlimit=1&titles=Scotland and import an "extract" tag. or possibly transform it to https://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&redirects=true&exlimit=1&titles=Scotland and import html page in "extract" tag. Could this be done? First option would be better as I could keep images in article and remove other content, although I need to learn how to use your method I'm not familiar with this language, I know how to use XPath (in Visual Web Ripper), this would be great if a user could specify XPath filters in @Voice for specific websites for content which should be removed. 1c. I think wikipedia is such an important site, it would be worth if it had own settings in @Voice (eg. Remove images; Remove image captions (keep images); remove table of content; remove tables; remove content below headings See also\|References\|External links.....; use api with selected parameters etc). Just an idea. 2. In edit speech there is a TAGS option which I don't know how to use, is there some kind of manual somewhere? I couldn't find any. Does it have something to do with SSML tags? Thank you. Quote #1 7 Replies Related Threads
Admin Administrator Total Posts : 275 Reward points: 0 Joined: 2010/11/22 00:00:00 Location: USA Status: offline Re: Wikipedia 2019/08/02 05:11:33 (permalink) Maybe if you would send me by email the exact filter file you created, I could detect errors in it and send back corrections. What is wrong with the default text extraction from Wikipedia that @Voice app does? Tags in speech replacements are only for management purposes. You could e.g. have one set of replacements created for reading fiction, another for scientific or technical articles etc. - then enable or disable them quickly with a tag (select all replacements with say "science" tag and disable them, or enable them etc.) post edited by Admin - 2019/08/02 05:13:33 Quote #2
j001 User Total Posts : 0 Reward points: 0 Joined: 2019/07/31 03:32:38 Status: offline Re: Wikipedia 2019/08/03 07:41:35 (permalink) Well, there are plenty of issues, for example I'd like to remove tables, image captions, certain sections (like References, See Also, Further Reading, etc), or headings with [edit] link on the right side are fused together (History [edit] becomes Historyedit). More examples could be found surely. I've just sent you an email with the filter, thank you. Quote #3
Admin Administrator Total Posts : 275 Reward points: 0 Joined: 2010/11/22 00:00:00 Location: USA Status: offline Re: Wikipedia 2019/08/03 07:45:47 (permalink) The filter file you've sent me works fine. The problem is probably: in which folder to you place this .json filter file? User created filters should go into Filters sub-folder of the main @Voice data folder (usually something like .../Android/data/com.hyperionics.avar/files, although it may be moved elsewhere with Settings). The .config/filters folder is only for filters downloaded from my web site, and files you place there will be ignored. Greg Quote #4
j001 User Total Posts : 0 Reward points: 0 Joined: 2019/07/31 03:32:38 Status: offline Re: Wikipedia 2019/08/03 08:59:24 (permalink) Yes I had it in the other folder. Now it works, great. And what kind of language is this? Can I find a list of commands/parameters/attributes somewhere? Or where can I download your filters (I can see some in webcfg.txt), maybe I can learn something from them. Thank you. Quote #5
Admin Administrator Total Posts : 275 Reward points: 0 Joined: 2010/11/22 00:00:00 Location: USA Status: offline Re: Wikipedia 2019/08/03 10:08:07 (permalink) The app downloads filter files as needed, for example if I had a filter file for wikipedia.org site, it would get downloaded when the user first opens a link to wikipedia.org in @Voice app. I don't have any documentation for this filter format, it is my creation and I extend it as needed. I have a "notes to myself" file about it. I'll paste it here: { "extractor": "default\|full\|Readability", "file": "(./vesper/.+)\|(:pasted:)", OR "url": "https?://fee\\.org/articles/", "sampleLink": "https://www.something.com/article_link...", "siteType": "mobile\|desktop" // mobile, desktop, or system default if not set "userAgent": "@\|literal string\|absent-use code default" // @ means use WebView user agent, or string like "Mozilla/5.0 (Linux; Android 6.0.1; vivo 1603 Build/MMB29M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.83 Mobile Safari/537.36" "title": "Some title...", ... "waitFor": { // NOT IMPLEMENTED YET! "waitBefore": 100, // ms to wait before using selector, default 0 "selector": "div[class^=\\"starts_with--\\"]", "index": 1, // 0 is default, use document.querySelector() else .querySelectorAll() "condition": "present\|absent\|vis\|invis\|hi0\|hi9" // present is default, hi9 means height is > 0 "maxWait": 5000, // ms max time to wait for condition, default 5000 "waitAfter": 200 // ms to wait after condition satisfied, default 0 }, "nextPgLink": { // works only with Readability extractor! "selector": "a[class=\\"fl-rt\\"]", "index": 0, // 0 is default, use document.querySelector() else .querySelectorAll() "maxPgs": 0 // max number of pages to process, default 0 - unlimited }, "readMoreBtn": { // all optional, but must identify a button uniquely "waitTime": 1000, // wait for XX milliseconds, before doing anything else "btnTag": "button", // usually button, may be something else "btnText": "read more", "btnId": "some_id", "btnClass": "some_class", "parentTag": "div" "parentId": "some_id" "parentClass": "some_class" "scrollTimes": 8, "numToPress": 1, // 0: press all, 1 only first, 2 - only 1st and 2nd etc., default 1 "fgOnly": true // if true, won't call moveTaskToBack(true); in ExtractBrowserActivity.java, when called with inBackground: true attribute }, "nodeAdd": [ { "tag": "div", "times": 1, "attrib": [{ "name": "class", "val": "article__wrapper" }]} // times optional, default 0 - adds all elements found ], "nodeRemove": [ { "tag": "div", "times": 2, "attrib": [{ "name": "class", "val": "wrapper__mobile-wide-ad-container." }]} // times optional as above ], "appendHtml": [ { "text": "<p avar_='stop'><i>To load more answers from Quora, press the "Reload or clear" button on top (circular arrows), then pres "Load from browser...". Next scroll the page as much as you want, and finally press the loudspeaker button at bottom-right.</i></p>" } ], "edit": [ // text edit - replace or remove sentences { "repeat": true, "from": "^\\soder\\s$", "until": ":PAR2", "replace": "" } } "file" or "url" Can also contain ":pasted:" to process directly pasted text line in the RegEx above. "readMoreBtn" Must provide at least a unique btnId, or btnTag + btnText or btnClass (or both) parentTag and parentClass or parentId may be provided to narrow the button search. "edit" "until" :PAR2 - delete 2 paragraphs, the current one + one more after it etc. :PAR-2 - delete the current paragraph + one before it etc. :END: - delete everything from the current sentence until the end. or RegEx for the last sentence to be deleted/replaced "from" "" - empty means from the very top of the text or RegEx for the first sentence to be deleted/replaced post edited by Admin - 2019/08/03 10:14:36 Quote #6
j001 User Total Posts : 0 Reward points: 0 Joined: 2019/07/31 03:32:38 Status: offline Re: Wikipedia 2019/08/07 11:51:12 (permalink) So I made this filter for Wikipedia: { "extractor": "full", "disabled": false, "url": "https?://en\\.wikipedia\\.org/", "edit": [ { "repeat": false, "from": "^(See also\|Notes\|Footnotes\|Images\|Gallery\|References\|Bibliography\|External links\|Further reading)$", "until": ":END:", "replace": "" } ], "nodeAdd": [ { "tag": "div", "attrib": [{ "name": "id", "val": "mw-content-text" }]}, ], "nodeRemove": [ { "tag": "div", "attrib": [{ "name": "id", "val": "toc" }]}, { "tag": "div", "attrib": [{ "name": "class", "val": "gallerytext" }]}, { "tag": "div", "attrib": [{ "name": "class", "val": "thumbcaption" }]}, { "tag": "div", "attrib": [{ "name": "role", "val": "note" }]}, { "tag": "span", "attrib": [{ "name": "class", "val": "mw-editsection" }]}, { "tag": "sup", "attrib": [{ "name": "class" }]}, { "tag": "table" }]} ] } I have a few problems here: 1. regex in "edit" parameter doesn't work, specifically ^ and $ (beginning and end of sentence), if these are removed, it works, but it is useless as it could remove any content, not just everything below those headings. 2. "edit" parameter doesn't work if put at the bottom (behind "nodeRemove") - try to remove ^ and $ and put it there. Nothing will change. 3. Also, in Edit speech options, in my RegEx formulas is space ignored, even if I use \s Is there a way how to remove everything below (See also\|Notes\|Footnotes\|Images\|Gallery\|References\|Bibliography\|External links\|Further reading) headings on Wikipedia pages? Thank you. Quote #7
Admin Administrator Total Posts : 275 Reward points: 0 Joined: 2010/11/22 00:00:00 Location: USA Status: offline Re: Wikipedia 2019/08/08 04:40:52 (permalink) Your Edit regex probably does not match the actual contents of what is in the text, after the DOM filters (nodeAdd, nodeRemove) are done. There may be some HTML code, not visible but present within the text. You could save the HTML text after extraction and look what is exactly there. For example, instead of a space there may be   code, and a lot more. Quote #8

Jump to:

© 2024 APG vNext Commercial Version 5.1