ARTICLE

OCR functionality through MODI for extracting text information from Image file in VB.NET

Posted by Hirendra Sisodiya Articles | Office and VB.NET February 17, 2010
Article show that how to extract text and layout information from image file like MDI and TIFF file format
 
Reader Level:

Simply OCR means Optical Character Recognition. We can extract text and layout information from image file like MDI and TIFF file format. When one scan a paper page into a computer, it produces just an image file, a photo of the page. The computer cannot understand the letters on the page; you would use OCR functionality to convert it into a text or word processor file, so that you can read text.

it can be performed by Microsoft Office Document Imaging Object Model,for it we are need to use  the MODI Library in a Development Project.The MODI object model consists of the following objects:
 

               Document object:     Represents an ordered collection of pages (images).

               Image object:           Represents a single page of a document.

               Layout object:          Represents the results of optical character recognition (OCR) on a page.

               MiDocSearch object:  Exposes document search functionality.

               Viewer control:          Is an ActiveX control that displays the pages of a document

  Example for extracting text from tif file:
 

        Dim strWordInfo As String

        Dim docs As New MODI.Document

        docs.Create("C:\test.tif")

     
       Dim Success As Integer = Analyse(docs)

        If Success Then

            Dim j As Integer

            For j = 0 To docs.Images.Count - 1

                strWordInfo = strWordInfo & " " & docs.Images(0).Layout.Text

            Next

            strWordInfo = strWordInfo.Replace("'", "''").ToString()

        End If

       Function Analyse(ByVal Doc As MODI.Document) As Integer

            If Doc Is Nothing Then

               Exit Function

            End If

        Try

            '  MODI call for OCR

            ' _MODIDocument.OCR(_MODIParameters.Language, '_MODIParameters.WithAutoRotation,              _MODIParameters.WithStraightenImage)

            Doc.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, True, True)

            Analyse = 1

        Catch ex As Exception

            'MessageBox.Show("OCR was successful but no text was recognized")                 

            Analyse = 0

        End Try

    End Function

Note : The most important point here to performing all tasks is to add a reference to " Microsoft Office Document Imaging Type Library", In case of

 Microsoft Outlook 2003, Add "
Microsoft Office Document Imaging 11.0 Type Library "
 Microsoft Outlook 2007, Add "
Microsoft Office Document Imaging 12.0 Type Library "

Login to add your contents and source code to this article
share this article :
post comment
 

i have send my email id
please Check your message box of this site

thanks

Posted by Hirendra Sisodiya Aug 06, 2010

hi hitender

Thanx for reply but how i can attach my course code and sample tiff file here.

Regards
Asheesh Panwar

Posted by asheesh panwar Aug 06, 2010

can you send me your code and that tiff file...

thanks

Posted by Hirendra Sisodiya Aug 05, 2010

Hi Hirendra,

Thanks for the reply i also did a lot of R&D but not success to remove all the junk character. I think i am not working in a correct way will you please suggest me to choose the correct way.

i will be very much thankful to you.

Regards
Asheesh Panwar

Posted by asheesh panwar Aug 05, 2010

Hello Asheesh

i think we can do anything in that..but you can write your own function for replacing these types of junk characters from blank space as possible..

thanks

Posted by Hirendra Sisodiya Aug 04, 2010
Become a Sponsor
PREMIUM SPONSORS
  • Finally – a virtual platform that delivers next-generation Windows Server 2008 Hyper-V virtualization technology from a managed hosting partner you can truly depend on. Visit www.maximumasp.com/max for a FREE 30 day trial. Hurry offer ends soon. Climb aboard the MaxV platform and take advantage of High Availability, Intelligent Monitoring, Recurrent Backups, and Scalability – with no hassle or hidden fees. As a managed hosting partner focused solely on Microsoft technologies since 2000, MaximumASP is uniquely qualified to provide the superior support that our business is built on. Unparalleled expertise with Microsoft technologies lead to working directly with Microsoft as first to offer IIS 7 and SQL 2008 betas in a hosted environment; partnering in the Go Live Program for Hyper-V; and product co-launches built on WS 2008 with Hyper-V technology.
    ceTE software specializes in components for dynamic PDF generation and manipulation. The DynamicPDF™ product line allows you to dynamically generate PDF documents, merge PDF documents and new content to existing PDF documents from within your applications.
Become a Sponsor