Saturday, February 25, 2012

Custom Connection Manager or SSIS Provided one?

Hi,

I have been working with a client on a legacy application migration to SSIS and I found SSIS to be very helpful, as this is my first project using integration services I found myself in a situation where multiple solutions can work but I cannot assess the risk of using one or the other and the effort required for the implementation.

The situation is as following, we cannot touch the input files that come from the other legacy systems so I need to get information that is packaged following the next multipart schema:

&Header with context information from the source
&BatchHeader with context information from the batch (including format N)
- Data in format N
&BatchHeader with context information from the batch (including format K)
+Data in format K
&BatchHeader with context information from the batch (including format P)
=Data in format P

Etc... Needless to say that they can come in whatever order and they need different processing depending on the format. I have been tempted to use the techniq shown in "Handling Different Row Types In The Same File" from http://www.sqlis.com/default.aspx?54.

However, there are multiple issues that I have to resolve to do that: First is how to feed the pipeline with the Data in format N and then reparse them in an easy way cause the data that comes from the other system is very wide (aprox 120 columns), so using a parsing script is by no means a viable solution (I have multiple formats to handle). So I do not see a way to treat the data without using a Custom Connection Manager. Here comes my first question: Is there anyway someone can use the already provided managers to handle an scenario like this one?

Next comes the separation, which one is the wisest thing to do: Write again that to disk in as easier format or pass along that information to the next task in memory? The volume of data is in the order of hundreds of megabytes (100 to 200), maybe half a Gigabyte in the extreme cases. How about uploading that to SQL Server and work from there?

BTW, the production servers will have 6 to 8 Gb of memory and 4 to 8 processors (more or less).

Thanks in advance.

Federico


Hi, I am not sure I understand your questions.

>>>> First is how to feed the pipeline with the Data in format N and then reparse them

The concept show in the article loads the whole row as 1 column to get the data into the pipeline and then you make decissions and parse from there. The article shows using script components to parse but you could also just use a derived column transform and the expression syntax, it all depends on the complexity of splitting your file. My guess is the trick in your case is identification of the rows. Is there something that can identify each row?

File header (multiple rows?)
K_BatchHeader (multiple rows?)
K_rows
P_BatchHeader (Multiple Rows?)
P_rows
...
...

Or is the only way you know K rows for sure is they follow the KBatchHEader and precede some other header?
Could be a multi step approach is best. One to "Split" parse to the source files to individual file formats (k,p,N) and another to read those files and do the actual data processing. This way, the 2nd process can likely use the built in parsing features.

|||

What I can suggest is an approach.

Creating a custom connection manager won't do any good here because connection managers are agnostic to the Dataflow task. What you need is a custom source adapter...

Or, in certain cases you can use the flat file adapter to read in the file row by row and use a script transform to do the parsing.

The best thing to do is to create a custom source adapter. You'll have much more control and the solution will ultimately be more robust.

K

|||although a custom source component will do the trick, a script component may suffice.|||

Thanks Craig, you understood correctly :)

In fact the solution you propose is what I am going to implement because I could be able to get a sample of the format (what I had until this point was the functional extraction of the legacy Cobol programs) and the rows format is not even normalized. So the problem now reduces to Normalize the file format (where I can split it with a custom task) and then do the Staging on the database and further processing.

So in fact I am moving toward your solution ;)

Again, thanks for the help.

Greetings
Federico Andrs Lois

|||

Yeah I used the wrong name, I meant a source adapter :)

As I wrote to Craig, I am moving toward a prenormalization and splitting in a task just because I found the data wasnt as clean as I though (regarding the format) so it will be required a normalization step before that.

Thanks too for the response.

Greetings
Federico Andrs Lois

No comments:

Post a Comment