标题: 【OCR教程】OCR系列教程四:标点符号识别篇【转自文心阁】 [打印本页]

作者: 天涯凝望    时间: 2011-12-15 19:12
标题: 【OCR教程】OCR系列教程四:标点符号识别篇【转自文心阁】
  本教程需要用到一个软件:EmEditor、两个脚本:排版脚本1.4和标点符号替换脚本,标点符号替换脚本是用作校对前的文本处理,而排版脚本是校对后用于文本处理的

  一般以原图大小的图进行识别,标点符号是大部分识别不出来的,将图片进行预处理,放大到2倍,就是为了能识别出标点符号。在汉王文本王中校对标点符号也要注意,如果是[attach]50596[/attach]想做字库的话,就不要全部都纠正过来,最终效果见下图,这样做只是为了更方便的批量处理。
  一般(,)号识别为(1)跟(.),只纠正(.),不纠正(1);(“)号、(”)号跟(……)号不要纠正;其他的符号都可以纠正。
  当然,上面说的是没有字库工程的情况下,有了字库之后,识别之后直接导出TXT,然后用EmEditor的“标点符号替换脚本”批量处理标点符号。
  

  软件很简单,在字库工程应用篇中也有涉及,这里就不累述了,标点符号的替换脚本具体内容如下,除了标点符号外,我也添加了一些常出现的错误,大家在使用的过程中发现有规律的错误也可以自行添加——


document.selection.Replace(\"......\",\"……\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\".....\",\"……\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"1…….\",\"“……\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"1.\",\"“\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"..\",\"“\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\".1\",\"“\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"1,\",\"”\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\",1\",\"”\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\",.\",\"”\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\".,\",\"”\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\",,\",\"”\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"1\",\",\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\".\",\",\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\",卜\",\"小\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"i\",\"训\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"训,练\",\"训练\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"”卜\",\",小\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"“卜\",\",小\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"小、\",\"小\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"咋、\",\"个\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"咋,\",\"个\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"介、\",\"个\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"介,\",\"个\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"斤、\\n\",\"个\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"斤,\\n\",\"个\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\",”\",\"”,\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"!,“\",\"!”,\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"。,“\",\"。”,\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"?,“\",\"?”,\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"…,“\",\"…”,\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"……“\",\"“……\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"“,\",\",“\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"“““\",\"……\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"““\",\"“\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"。亨\",\"哼\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"口亨\",\"哼\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"包囊\",\"包裹\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"囊住\",\"裹住\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"六卜\",\"“小\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"止,\\n\",\"山\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"犬……\",\"大……\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"”比如\",\",恍如\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"”比若\",\",恍若\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\",比若\",\"恍若\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\",比惚\",\"恍惚\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"录削\",\"剥削\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"录夺\",\"剥夺\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"录开\",\"剥开\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"刚,\\n\",\"刚\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"刚,刚\",\"刚\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"刚,才\",\"刚\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"剑,\\n\",\"剑\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"光,\\n\",\"光\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
document.selection.Replace(\"划,\\n\",\"划\",eeFindNext | eeFindReplaceEscSeq | eeReplaceAll);
作者: 天涯凝望    时间: 2011-12-15 19:14
起点图源-去乱码、重排段落修订脚本

每个起点帐号的数字ID都不相同的,所以请自行修改代码。

脚本代码如下:


if (document.selection.isempty)
nFlags=eeFindNext | eeReplaceAll | eeFindReplaceRegExp;
else
nFlags=eeFindNext | eeReplaceSelOnly | eeReplaceAll | eeFindReplaceRegExp;
document.selection.Replace("^","囧",nFlags);
document.selection.Replace(" ","——",nFlags);
document.selection.Replace("————","——",nFlags);
document.selection.Replace("————","——",nFlags);
document.selection.Replace("————","——",nFlags);
document.selection.Replace("————","——",nFlags);
document.selection.Replace("————","——",nFlags);
document.selection.Replace("————","——",nFlags);
document.selection.Replace("————","——",nFlags);
document.selection.Replace("————","——",nFlags);
document.selection.Replace("————","——",nFlags);
document.selection.Replace("囧——","  ",nFlags);
document.selection.Replace("囧","",nFlags);
document.selection.Replace("留,6呕\\n","",nFlags);
document.selection.Replace("——酣.+\\\\n","\\n",nFlags);
document.selection.Replace("——留.+\\\\n","\\n",nFlags);
document.selection.Replace("——毖.+\\\\n","\\n",nFlags);
document.selection.Replace("——鳃.+\\\\n","\\n",nFlags);
document.selection.Replace("——蚓.+\\\\n","\\n",nFlags);
document.selection.Replace("^加x.+\\\\n","\\n\\n",nFlags);
document.selection.Replace("^奶x.+\\\\n","\\n\\n",nFlags);
document.selection.Replace("\\n","",nFlags);
document.selection.Replace("。  ","。\\n  ",nFlags);
document.selection.Replace("?  ","?\\n  ",nFlags);
document.selection.Replace("…  ","…\\n  ",nFlags);
document.selection.Replace("。  ","。\\n  ",nFlags);
document.selection.Replace(":  ",":\\n  ",nFlags);
document.selection.Replace("!  ","!\\n  ",nFlags);
document.selection.Replace(")  ",")\\n  ",nFlags);
document.selection.Replace("”  ","”\\n  ",nFlags);
document.selection.Replace("※  ","※\\n  ",nFlags);
document.selection.Replace("」  ","」\\n  ",nFlags);
document.selection.Replace("(未完待续,如欲知后事如何.+\\\\n","\\n\\n\\n",nFlags);




欢迎光临 (http://ftp.zasq.net/~zazww/) Powered by Discuz! X3.2