Page 1 of 3

Posted: Mon Oct 20, 2014 11:24 am
by Malcolm
At work, we currently have a couple workflows. They both:

1) grab zipped XML files from a FTP server through a proxy
2) check the names against a list of files already processed
3) unzip the unprocessed files
4) parse XML
5) put XML values into a flat file
6) move flat file to some remote network storage space (so Oracle can use it for its external tables)

We use a mixture of Linux, Java, the Oracle DB scheduler, bash script, and this unwieldly thing I want to kill. Back when we use to beam info directly to the DB from the Linux boxes, I might've heard an argument. As it sits now, it's shredding XML and flattening the values. I don't need or want it.

We process, at most, tens of gigs at once. Any Oracle folk out there (like catt) have a standard tool for this kind of thing? I'm leaning towards getting rid of Kettle/Kitchen and its shitty design tool and converting everything into Python using proxyhandler in urllib2 for (S)FTP and cElementTree for XML parsing.




Edited By Malcolm on 1413818684

Posted: Mon Oct 20, 2014 12:12 pm
by TheCatt
Funnily enough, we're evaluating Pentaho right now.

But everything you describe there should be really easy in Pentaho. I wrote a similar script in about 60 minutes that does everything you describe except shredding the XML.

Maybe pay for the real version of Pentaho, instead of community.

Posted: Mon Oct 20, 2014 12:22 pm
by Malcolm
I'm sure it's easy if I can figure out the annoying visual designer.
I wrote a similar script in about 60 minutes that does everything you describe except shredding the XML.

Hah. I've got nothing except the XML parsing.

Posted: Mon Oct 20, 2014 1:08 pm
by TheCatt
I would say something about external table use being dumb... but I assume that the data set is small, and therefore not a performance problem?

Posted: Mon Oct 20, 2014 1:29 pm
by Malcolm
We used to pipe the values into the DB with SQL loader. It would occasionally crash during data loads with no explanation, probably due to the shitty JDBC lib it uses. Our local Oracle guru said to use files and external tables. Data set is maybe 20M entities (in the largest set) 10-20ish gigs over all 3 data sets.

EDIT: Worst thing about Pentaho:
We run the Kettle/Kitchen jobs on our Linux box. We design the config files on a Windows box. Pentaho will swap in non-linux friendly escape values during the design. When uploading the config files back to linux, they explode. So I have to un-convert those values back to linux-friendly mode.




Edited By Malcolm on 1413826236

Posted: Mon Oct 20, 2014 3:38 pm
by TheCatt
Your dev and prod environments should be the same. You should not develop and deploy on separate OSes.

Not aware that sqlldr uses JDBC, and is exceptionally reliable. I question your Oracle person. I've never had problems with it, using it for more than a decade.

Posted: Mon Oct 20, 2014 3:42 pm
by Malcolm
Your dev and prod environments should be the same. You should not develop and deploy on separate OSes.

HAHAHAHAHA. You obviously work in a very different place.

Not aware that sqlldr uses JDBC, and is exceptionally reliable.

It's kitchen that uses sqlldr to bulk load data to oracle. From a few things I've read, sqlldr + kitchen + JDBC = occasional sockets randomly closing during the upload to DB.

Posted: Mon Oct 20, 2014 4:19 pm
by TheCatt
I've not used sqlldr via Pentaho so far, but why not just call it as a script with parameters?

Posted: Mon Oct 20, 2014 4:20 pm
by TheCatt
Malcolm wrote:
Your dev and prod environments should be the same. You should not develop and deploy on separate OSes.
HAHAHAHAHA. You obviously work in a very different place.
VMs are cheap and easy.

Posted: Mon Oct 20, 2014 4:29 pm
by GORDON
Image

Posted: Mon Oct 20, 2014 5:04 pm
by TheCatt
I was expecting you to post "Your MOM is cheap and easy"

Posted: Mon Oct 20, 2014 5:12 pm
by GORDON
I fell asleep before I could think of it.

Posted: Mon Oct 20, 2014 8:19 pm
by Malcolm
TheCatt wrote:
Malcolm wrote:
Your dev and prod environments should be the same. You should not develop and deploy on separate OSes.

HAHAHAHAHA. You obviously work in a very different place.

VMs are cheap and easy.

This isn't a dev v. prod difference. Every environment, from dev to prod has linux and windows components running to the Oracle DB.

I'm leaning towards python if it can handle the retroactive work. We have a metric fuck-ton of memory on those servers that sits idle 99-100% of the time ... because that's how it's been "optimized."




Edited By Malcolm on 1413850922

Posted: Mon Oct 20, 2014 8:22 pm
by Malcolm
GORDON wrote:Image
No, that's the image for AS/400 discussions.

Posted: Mon Oct 20, 2014 8:29 pm
by GORDON
Malcolm wrote:
GORDON wrote:Image
No, that's the image for AS/400 discussions.
Which, out of good taste, I do not initiate.

Posted: Mon Oct 20, 2014 8:34 pm
by TheCatt
Malcolm wrote:This isn't a dev v. prod difference. Every environment, from dev to prod has linux and windows components running to the Oracle DB.

That statement does not square with this one
We run the Kettle/Kitchen jobs on our Linux box. We design the config files on a Windows box.

Pentaho is a very good tool for converting XML to flat files. Perhaps you should consider different developers?




Edited By TheCatt on 1413851691

Posted: Mon Oct 20, 2014 8:47 pm
by Malcolm
Every environment has a linux and windows component. Due to the shitty GUI designer required for config file work (I'm sure as shit not doing it by hand) and the fact that our linux boxes at work are strictly command line, I'm forced into that crap. Every environment has a linux component and a windows component. Our DB workflows are a goddamn mess.

Posted: Mon Oct 20, 2014 8:58 pm
by TheCatt
Every environment has Windows + Linux, yet you design on Windows, and run on Linux.

Posted: Mon Oct 20, 2014 9:19 pm
by GORDON
Every time you guys say "pentaho" I think of this:

Image

Posted: Mon Oct 20, 2014 9:49 pm
by TheCatt
That would be PEN-ta-ho(es). The software is pen-TAH-ho. For reasons unknown to me.