Tuesday, March 27, 2012

Customized filter for content searching

Hi !
I have some office documents which I am storing as image
type (as blobs) in a table. I have some additional header
data in the blob other than the content of the office
documents. Is there a way to integrate just the content
of teh office document with the SQL server search?
I know that one way to do it is by implementing IFilter.
Can someone explain how that will work or send me
appropriate links for that.
Another question is about the html files that have images
in it. How does that get stored in the database and yet
qualify for SQL server search? How can one store the html
file and the folder with images as blobs and yet enable
the search on the document?
Any help would be appreciated. I am using SQL server 2000
with all the service packs applied.
Thanks.
SQL FTS can only index document contents, not properties. Header and
footers of word docs are indexed as part of the document body.
So if by header data you mean document summary or custom office properties,
these are not indexed by SQL FTS. If you mean the header of the footer this
will work.
Images in html files referenced by metatags, ie img src will not be indexed,
as only document properties are indexed, not the contents of meta or src
tags.
The indexing that is done of image documents is rudimentary. Some of the
custom image iFilters do expose interfaces to index these properties, but
not in SQL FTS. In other search services like Sharepoint portal server,
Indexing services, and Exchange content indexing it is possible to index
some properties and ocr'd content of tiffs. But these are normally not
indexed as attachments or embedded objects of documents, and are not indexed
ever when they are parts of html docs. Again SQL FTS does not index them as
they are, for the most part properties.
Your best approach would be to extract the textual data from these documents
and store the metadata/properties in column in the table you are FTI'ing.
"kay" <anonymous@.discussions.microsoft.com> wrote in message
news:f6fa01c43de7$0591afb0$a501280a@.phx.gbl...
> Hi !
> I have some office documents which I am storing as image
> type (as blobs) in a table. I have some additional header
> data in the blob other than the content of the office
> documents. Is there a way to integrate just the content
> of teh office document with the SQL server search?
> I know that one way to do it is by implementing IFilter.
> Can someone explain how that will work or send me
> appropriate links for that.
> Another question is about the html files that have images
> in it. How does that get stored in the database and yet
> qualify for SQL server search? How can one store the html
> file and the folder with images as blobs and yet enable
> the search on the document?
> Any help would be appreciated. I am using SQL server 2000
> with all the service packs applied.
> Thanks.
|||Kay,
Could you provide more info in regards to your table structures, i.e.,
CREATE TABLE statements as this may be possible, if I understand your
requirement correctly. There is a way to integrate the content of the office
documents with SQL Server Full-Text Search (FTS) in SQL Server 2000.
Checkout the SQL Server 2000 Books Online (BOL) title "Filtering Supported
File Types"
As for your images (jpg files, etc.) you will need to store them separately
in a column defined with the IMAGE datatype. See the following KB articles
on importing & extracting binary files (images) into and out of SQL Server:
258038 (Q258038) HOWTO: Access and Modify SQL Server BLOB Data by Using the
ADO Stream Object
http://support.microsoft.com/?kbid=258038
309158 (Q309158) HOW TO: Read and Write BLOB Data by Using ADO.NET with C#
http://support.microsoft.com/default...b;EN-US;309158
308042 (Q308042) HOW TO: Read and Write BLOB Data by Using ADO.NET with
VB.NET
http://support.microsoft.com/default...b;EN-US;308042
326502 (Q326502) HOW TO: Read and Write BLOB Data by Using ADO.NET Through
ASP.NET
http://support.microsoft.com/?id=326502
Depending upon what you want to search on, you can implement a JPEG IFilter
or use the achnor text in the HTML file as the search string for the image.
If you have further questions, please post your table structures as well as
SQL FTS queries.
Regards,
John
"kay" <anonymous@.discussions.microsoft.com> wrote in message
news:f6fa01c43de7$0591afb0$a501280a@.phx.gbl...
> Hi !
> I have some office documents which I am storing as image
> type (as blobs) in a table. I have some additional header
> data in the blob other than the content of the office
> documents. Is there a way to integrate just the content
> of teh office document with the SQL server search?
> I know that one way to do it is by implementing IFilter.
> Can someone explain how that will work or send me
> appropriate links for that.
> Another question is about the html files that have images
> in it. How does that get stored in the database and yet
> qualify for SQL server search? How can one store the html
> file and the folder with images as blobs and yet enable
> the search on the document?
> Any help would be appreciated. I am using SQL server 2000
> with all the service packs applied.
> Thanks.
|||Thanks a lot Hilary, for your prompt reply.
I could get the Office document blobs working with FTS,
that is not where i faced the problems. I have a couple
of questions regarding issues pertaining these.
1) I want to add some of my own customized data that our
programming system is using other than the office
document blob. Say, I first add my own serialized data in
the blob and then add the office document data in the
blob and upload it in the image field. If I do that, is
it possible to still be able to use the FTS on the part
of the blob which is the actual office document data? I
mean, is there some way that I could write some code that
can give the FTS only the relevant data to be used for
indexing.
2) If I have a word document which has embedded pictures
and then I save it as a filtered Html, I get some images
in a folder and the images are linked in the html file.
How can I upload this filtered html document as a blob?
Is it that the folder containing the images has to be
stored separately from teh html blob? or is it that there
is some way in which both html and the folder with the
image are added in the same blob and yet work with FTS?
I hope my questions are clear. Any help from you would be
appreciated. I always can split up the blob and store as
separate fields, but if there is a way to do them all up
as the same blob, it would be great.
Thanks a lot !
Regards,
kay

>--Original Message--
>SQL FTS can only index document contents, not
properties. Header and
>footers of word docs are indexed as part of the document
body.
>So if by header data you mean document summary or custom
office properties,
>these are not indexed by SQL FTS. If you mean the header
of the footer this
>will work.
>Images in html files referenced by metatags, ie img src
will not be indexed,
>as only document properties are indexed, not the
contents of meta or src
>tags.
>The indexing that is done of image documents is
rudimentary. Some of the
>custom image iFilters do expose interfaces to index
these properties, but
>not in SQL FTS. In other search services like Sharepoint
portal server,
>Indexing services, and Exchange content indexing it is
possible to index
>some properties and ocr'd content of tiffs. But these
are normally not
>indexed as attachments or embedded objects of documents,
and are not indexed
>ever when they are parts of html docs. Again SQL FTS
does not index them as
>they are, for the most part properties.
>Your best approach would be to extract the textual data
from these documents
>and store the metadata/properties in column in the table
you are FTI'ing.
>
>"kay" <anonymous@.discussions.microsoft.com> wrote in
message[vbcol=seagreen]
>news:f6fa01c43de7$0591afb0$a501280a@.phx.gbl...
image[vbcol=seagreen]
header[vbcol=seagreen]
IFilter.[vbcol=seagreen]
images[vbcol=seagreen]
html[vbcol=seagreen]
2000
>
>.
>
|||Thanks a lot John, for your prompt reply.
I have posted the same question content to Hilary too. I
would appreciate it if you could also send me your
thoughts and expertise on this.
I could get the Office document blobs working with FTS,
that is not where i faced the problems. The search
results are satisfactory. I have a couple of questions
regarding issues pertaining these.
1) I want to add some of my own customized data that our
programming system is using other than the office
document blob. Say, I first add my own serialized data in
the blob and then add the office document data in the
blob and upload it in the image field. If I do that, is
it possible to still be able to use the FTS on the part
of the blob which is the actual office document data? I
mean, is there some way that I could write some code that
can give the FTS only the relevant data to be used for
indexing.
2) If I have a word document which has embedded pictures
and then I save it as a filtered Html, I get some images
in a folder and the images are linked in the html file.
How can I upload this filtered html document as a blob?
Is it that the folder containing the images has to be
stored separately from teh html blob? or is it that there
is some way in which both html and the folder with the
image are added in the same blob and yet work with FTS?
I hope my questions are clear. Any help from you would be
appreciated. I always can split up the blob and store as
separate fields, but if there is a way to do them all up
as the same blob, it would be great.
Thanks a lot !
Regards,
kay

>--Original Message--
>Kay,
>Could you provide more info in regards to your table
structures, i.e.,
>CREATE TABLE statements as this may be possible, if I
understand your
>requirement correctly. There is a way to integrate the
content of the office
>documents with SQL Server Full-Text Search (FTS) in SQL
Server 2000.
>Checkout the SQL Server 2000 Books Online (BOL)
title "Filtering Supported
>File Types"
>As for your images (jpg files, etc.) you will need to
store them separately
>in a column defined with the IMAGE datatype. See the
following KB articles
>on importing & extracting binary files (images) into and
out of SQL Server:
>258038 (Q258038) HOWTO: Access and Modify SQL Server
BLOB Data by Using the
>ADO Stream Object
>http://support.microsoft.com/?kbid=258038
>309158 (Q309158) HOW TO: Read and Write BLOB Data by
Using ADO.NET with C#
>http://support.microsoft.com/default.aspx?scid=kb;EN-
US;309158
>308042 (Q308042) HOW TO: Read and Write BLOB Data by
Using ADO.NET with
>VB.NET
>http://support.microsoft.com/default.aspx?scid=kb;EN-
US;308042
>326502 (Q326502) HOW TO: Read and Write BLOB Data by
Using ADO.NET Through
>ASP.NET
>http://support.microsoft.com/?id=326502
>Depending upon what you want to search on, you can
implement a JPEG IFilter
>or use the achnor text in the HTML file as the search
string for the image.
>If you have further questions, please post your table
structures as well as
>SQL FTS queries.
>Regards,
>John
>
>
>"kay" <anonymous@.discussions.microsoft.com> wrote in
message[vbcol=seagreen]
>news:f6fa01c43de7$0591afb0$a501280a@.phx.gbl...
image[vbcol=seagreen]
header[vbcol=seagreen]
IFilter.[vbcol=seagreen]
images[vbcol=seagreen]
html[vbcol=seagreen]
2000
>
>.
>
|||1) no, unless the iFilter which is associated with the extension of the document knows how to handle attachments or embedded documents, this is not possible. The iFilter interface is able to handle streams and storages, so the question is whether this iF
ilter has implemented it.
2) save the html doc as a web archive page (mht). This will contains all the html, images, etc. You will be able to only index the document body, and not of the "attachements", ie only pure text is exposed.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
-- kay wrote: --
Thanks a lot Hilary, for your prompt reply.
I could get the Office document blobs working with FTS,
that is not where i faced the problems. I have a couple
of questions regarding issues pertaining these.
1) I want to add some of my own customized data that our
programming system is using other than the office
document blob. Say, I first add my own serialized data in
the blob and then add the office document data in the
blob and upload it in the image field. If I do that, is
it possible to still be able to use the FTS on the part
of the blob which is the actual office document data? I
mean, is there some way that I could write some code that
can give the FTS only the relevant data to be used for
indexing.
2) If I have a word document which has embedded pictures
and then I save it as a filtered Html, I get some images
in a folder and the images are linked in the html file.
How can I upload this filtered html document as a blob?
Is it that the folder containing the images has to be
stored separately from teh html blob? or is it that there
is some way in which both html and the folder with the
image are added in the same blob and yet work with FTS?
I hope my questions are clear. Any help from you would be
appreciated. I always can split up the blob and store as
separate fields, but if there is a way to do them all up
as the same blob, it would be great.
Thanks a lot !
Regards,
kay

>--Original Message--
>SQL FTS can only index document contents, not
properties. Header and
>footers of word docs are indexed as part of the document
body.[vbcol=seagreen]
office properties,
>these are not indexed by SQL FTS. If you mean the header
of the footer this[vbcol=seagreen]
>will work.
will not be indexed,
>as only document properties are indexed, not the
contents of meta or src[vbcol=seagreen]
>tags.
rudimentary. Some of the
>custom image iFilters do expose interfaces to index
these properties, but
>not in SQL FTS. In other search services like Sharepoint
portal server,
>Indexing services, and Exchange content indexing it is
possible to index
>some properties and ocr'd content of tiffs. But these
are normally not
>indexed as attachments or embedded objects of documents,
and are not indexed
>ever when they are parts of html docs. Again SQL FTS
does not index them as[vbcol=seagreen]
>they are, for the most part properties.
from these documents
>and store the metadata/properties in column in the table
you are FTI'ing.[vbcol=seagreen]
message[vbcol=seagreen]
>news:f6fa01c43de7$0591afb0$a501280a@.phx.gbl...
image[vbcol=seagreen]
header[vbcol=seagreen]
IFilter.[vbcol=seagreen]
images[vbcol=seagreen]
html[vbcol=seagreen]
2000
>
|||Hi !
Thanks again for your reply. Have you come across any
working examples of a customized IFilter code? If so, can
you send me the link?
Thanks for your help.
Regards,
kay

>--Original Message--
>1) no, unless the iFilter which is associated with the
extension of the document knows how to handle attachments
or embedded documents, this is not possible. The iFilter
interface is able to handle streams and storages, so the
question is whether this iFilter has implemented it.
>2) save the html doc as a web archive page (mht). This
will contains all the html, images, etc. You will be able
to only index the document body, and not of
the "attachements", ie only pure text is exposed.
>Looking for a SQL Server replication book?
>http://www.nwsu.com/0974973602.html
>
> -- kay wrote: --
>
> Thanks a lot Hilary, for your prompt reply.
> I could get the Office document blobs working with
FTS,
> that is not where i faced the problems. I have a
couple
> of questions regarding issues pertaining these.
> 1) I want to add some of my own customized data
that our
> programming system is using other than the office
> document blob. Say, I first add my own serialized
data in
> the blob and then add the office document data in
the
> blob and upload it in the image field. If I do
that, is
> it possible to still be able to use the FTS on the
part
> of the blob which is the actual office document
data? I
> mean, is there some way that I could write some
code that
> can give the FTS only the relevant data to be used
for
> indexing.
> 2) If I have a word document which has embedded
pictures
> and then I save it as a filtered Html, I get some
images
> in a folder and the images are linked in the html
file.
> How can I upload this filtered html document as a
blob?
> Is it that the folder containing the images has to
be
> stored separately from teh html blob? or is it that
there
> is some way in which both html and the folder with
the
> image are added in the same blob and yet work with
FTS?
> I hope my questions are clear. Any help from you
would be
> appreciated. I always can split up the blob and
store as
> separate fields, but if there is a way to do them
all up[vbcol=seagreen]
> as the same blob, it would be great.
> Thanks a lot !
> Regards,
> kay
> properties. Header and
document[vbcol=seagreen]
> body.
custom[vbcol=seagreen]
> office properties,
header[vbcol=seagreen]
> of the footer this
img src[vbcol=seagreen]
> will not be indexed,
> contents of meta or src
> rudimentary. Some of the
index[vbcol=seagreen]
> these properties, but
Sharepoint[vbcol=seagreen]
> portal server,
it is[vbcol=seagreen]
> possible to index
these[vbcol=seagreen]
> are normally not
documents,[vbcol=seagreen]
> and are not indexed
FTS[vbcol=seagreen]
> does not index them as
textual data[vbcol=seagreen]
> from these documents
table[vbcol=seagreen]
> you are FTI'ing.
wrote in[vbcol=seagreen]
> message
as[vbcol=seagreen]
> image
additional[vbcol=seagreen]
> header
office[vbcol=seagreen]
content[vbcol=seagreen]
search?[vbcol=seagreen]
> IFilter.
have[vbcol=seagreen]
> images
and yet[vbcol=seagreen]
the[vbcol=seagreen]
> html
enable[vbcol=seagreen]
server
> 2000
>.
>
|||MSDN is your best source for this. I'd also check out iFilterShop.com for custom iFilters.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
|||Would you know if a customized IFilter could call the
Office filters?

>extension of the document knows how to handle
attachments
>or embedded documents, this is not possible. The
iFilter
>interface is able to handle streams and storages, so the
>question is whether this iFilter has implemented it.
>will contains all the html, images, etc. You will be
able[vbcol=seagreen]
>to only index the document body, and not of
>the "attachements", ie only pure text is exposed.
>FTS,
>couple
>that our
>data in
>the
>that, is
>part
>data? I
>code that
>for
>pictures
>images
>file.
>blob?
>be
that[vbcol=seagreen]
>there
>the
>FTS?
>would be
>store as
>all up
>document
or[vbcol=seagreen]
>custom
>header
>img src
>index
>Sharepoint
>it is
>these
>documents,
>FTS
>textual data
the[vbcol=seagreen]
>table
>wrote in
>as
>additional
>office
>content
>search?
me[vbcol=seagreen]
>have
>and yet
store[vbcol=seagreen]
>the
yet
>enable
>server
>.
>
|||Would you know if a customized IFilter could call the
Office filters, and these office filters can then be used
for the catalog building?

>--Original Message--
>MSDN is your best source for this. I'd also check out
iFilterShop.com for custom iFilters.
>Looking for a SQL Server replication book?
>http://www.nwsu.com/0974973602.html
>.
>
sql

No comments:

Post a Comment