Monday, March 26, 2012

import hierarchical data

hi folks,

I have to import hierarchical text files like:
32;country;city;postalcode;street
21;name;firstname;salutation;title;age;nickname
21;name;firstname;salutation;title;age;nickname
...

additionally I have to eleminate doubles. what is the best way for this problem ?
I have set up a flatfilesource with two columns and a conditional split on the first column
so now I have an output with [country;city;postalcode;street] and one with [name;firstname;salutation;title;age;nickname]. How do I split this in columns, put it in a dataset keeping the relations and remove doubles ?

Iam looking forward for any helping idea.

rgrds,
matze

1. Use the Sort transform to remove dupes.

2. This is just a string manipulation problem right? There is a library of string manipulation functions available in the Derived Column transform.

-Jamie

|||

Thanks for reply Jamie,

First I cannot find any functionality to split the input column into multiple output columns in Derived Colmn Transform. Second how to get this in a pair of related tables ?

please help out

|||

Have you looked at the FINDSTRING and SUBSTRING functions?

When you say the tables are related do you mean there is a foreign key between them? if so you can use 1 f the methods defined here: http://blogs.conchango.com/jamiethomson/archive/2006/02/17/2877.aspx

-Jamie

|||Hi Jamie,

Iam back on this and its driving me nuts.
You wont believe it, but Iam a absolute novice on SSIS ;)
So please let me explain more in detail and exercise paitience with me.

thats the schema:
32;country;city;postalcode;street
21;name;firstname;salutation;title;age;nickname

32 identifies a city line, 21 identifies a person line

example:
32;england;london;2445;poolstreet 45
21;smith;joe;mr;;23;jo
21;henderson;paul;mr;doctor;54;paul
32;sweden;stockholm;4356;javanstreet 2
21;smith;joe;mr;;23;jo
32;england;london;2445;poolstreet 45
21;jhonson;pieter;mr;;33;pieter

let there be a table 'city' and a table 'person' with an m:n relatation table cityperson
I want to have a result as follows:
table city
1 england london 2445 poolstreet 45
2 sweden stockholm 4356 javanstreet 2

tabel person
1 smith joe mr 23 jo
2 henderson paul mr doctor 54 paul
3 jhonson pieter mr 33 pieter

table cityperson
1 1 1
2 1 2
3 1 3
4 2 1

whereas the first column idicates an identity value

What i did so far is reading my source file with flat file source into two columns (id, line) make a conditional split on id and make two derivied column transform. One for city lines and one for person lines. In these derived column transforms I split up the line column (eg. england;london;2445;poolstreet 45) with Findstring amd Substring into separate columns for each field. Now I have the two outputs a) all columns for a city and b) all columns for a person. What to do now ? How to get this into the tables three ?
By the way isnt it quite slow to extract the columns with Findstring and Substring form a line string ? I have to deal with files of nearly 30 megabytes. Is there a better way to accomplish this ?

|||

FINDSTRING() and SUBSTRING() in the Derived Column component are the best options and fastest available to you. I don't think you'll find a bottleneck there.

I understand that you now need a way of populating CityPerson. What I don't understand is how a relationship is defined in the source file between a city and a person.

In other words....how do you know by looking at the source file which city a person is in? Is it the City line that is immediately above it?

-Jamie

|||

Jamie,

>FINDSTRING() and SUBSTRING() in the Derived Column component are the best options and fastest available to you. I don't think you'll find a bottleneck there.

ok, that pacified me

>In other words....how do you know by looking at the source file which city a person is in? Is it the City line that is immediately above it?

Exactly. You got it, its defined by the city line above.
Not realy nice, I know. Therefore my problems keeping the relation.

again thanks for your support,
Matze

|||

OK, so there are probably a million and one ways to do this - but here's one method.

First, before the split, give each row an ID using the technique here: http://www.sqlis.com/default.aspx?37

Later in the pipeline, join the Person path back with the City path using a MERGE JOIN (you'll need a MULTICAST on the Person path). But only join where the City ID is less than the Person ID field. The output fields from the MERGE JOIN should be everything from the Person input and the ID from the City input.

The output from the MERGE JOIN should then become the input to a AGGREGATE transform. Inside there group by all the Person fields and select the maximum City ID. You will then have a record for each person along with the ID of the City field. It should then be pretty easy to constuct the CityPerson table.

Another way to describe the above is that you're building the following SQL statement:

SELECT p.*, MAX(c.ID) as CityID FROM Person p, INNER JOIN City c ON p.ID > c.ID GROUP BY p.*

where p & c are the inputs to the MERGE JOIN.

Hope that helps!!

-Jamie

|||

Jamie,

it's me the greenhorn again ;)
I have set up the ID for each row, but now become desperate on Merge Join.
>But only join where the City ID is less than the Person ID field.
I do not see a option to set up this in the Merge Join component. Seems to join only on ID equals ID.
Another issue is the requirement for piplines to be sorted. Do to the fact that the merge column is the ID and the ID is sorted, could I define this circumstance anywhere ?

thanks for your help again,
Matze

|||

Easy one first. On the output from the component where you add the IDs set IsSorted=TRUE and set SortKey=1 on the ID column. The data paths will then be sorted.

You're right about the MERGE JOIN tho. I admit I thought that you could do this and I didn't actually check it. My apologies for that. That's really bothered me actually - that's a real limitation of the component.

So, we need another solution. You can probably achieve this all in a single script component actually thinking about it. It will need to be an asynchronous component positioned before your conditional split but after your script component that adds the ID.

Loop over the rows. Each time you get a new City row store its ID in a local variable and add it to each Person row until you encounter the next City record.

It will need 2 outputs - 1 for City and 1 for Person

Here's a link that might help:

Difference between synch and asynch components: http://blogs.conchango.com/jamiethomson/archive/2005/07/25/1841.aspx

Sorry I led you the wrong way on this. Hopefully this method should work.

-Jamie

|||

Hi Matze,

I've re-read this thread and realised I've led you on a bit of a merry dance. Sorry about that.

To try and make up for it I had a go at building this myself and have come up with something pretty simple that works fine. If you want to drop me a mail via the contact link on my blog homepage (address below) then I'll be able to send it to you.

-Jamie

|||

Hi,

I replied to the address from which your mail came (mail@.mhess.com) but got a mail delivery failure.

Do you wanna email me direct at jamie.thomson@.removethisbit.conchango.com

-Jamie

sql

No comments:

Post a Comment