Simply OCR means Optical Character Recognition. We can extract text and layout information from image file like MDI and TIFF file format. When one scan a paper page into a computer, it produces just an image file, a photo of the page. The computer cannot understand the letters on the page; you would use OCR functionality to convert it into a text or word processor file, so that you can read text.
it can be performed by Microsoft Office Document Imaging Object Model,for it we are need to use the MODI Library in a Development Project.The MODI object model consists of the following objects:
Document object: Represents an ordered collection of pages (images).
Image object: Represents a single page of a document.
Layout object: Represents the results of optical character recognition (OCR) on a page.
MiDocSearch object: Exposes document search functionality.
Viewer control: Is an ActiveX control that displays the pages of a document
Example for extracting text from tif file:
Dim strWordInfo As String
Dim docs As New MODI.Document
docs.Create("C:\test.tif")
Dim Success As Integer = Analyse(docs)
If Success Then
Dim j As Integer
For j = 0 To docs.Images.Count - 1
strWordInfo = strWordInfo & " " & docs.Images(0).Layout.Text
Next
strWordInfo = strWordInfo.Replace("'", "''").ToString()
End If
Function Analyse(ByVal Doc As MODI.Document) As Integer
If Doc Is Nothing Then
Exit Function
End If
Try
' MODI call for OCR
' _MODIDocument.OCR(_MODIParameters.Language, '_MODIParameters.WithAutoRotation, _MODIParameters.WithStraightenImage)
Doc.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, True, True)
Analyse = 1
Catch ex As Exception
'MessageBox.Show("OCR was successful but no text was recognized")
Analyse = 0
End Try
End Function
Note : The most important point here to performing all tasks is to add a reference to " Microsoft Office Document Imaging Type Library", In case of
Microsoft Outlook 2003, Add " Microsoft Office Document Imaging 11.0 Type Library "
Microsoft Outlook 2007, Add " Microsoft Office Document Imaging 12.0 Type Library "