Page 1 of 2

Posted: Fri Oct 19, 2007 10:58 am
by Malcolm
Alright, I need to duplicate what I think is the real DB of info for our hosted sites. Unfortunately, I can't just write a convenient little script to grab that info due to various things. I just need a small app that can grab a list of webpages (following a regular pattern) & save the HTML source to a .txt file (for later parsing). Anyone got any favourite macro programs they use?

Posted: Mon Oct 22, 2007 1:18 pm
by Malcolm
Fuck it, finally broke down & got the latest version of EZ Macros.

Posted: Mon Oct 22, 2007 1:40 pm
by TheCatt
Curl. Possibly curl inside of a VBScript/perl/whatever.

Posted: Mon Oct 22, 2007 1:41 pm
by GORDON
Malcolm wrote:Fuck it, finally broke down & got the latest version of EZ Macros.
I know where you can get a version of EZ Macros from 1998 where you get the full version just by installing and changing the executable from "ezeval" to "ezmacros."

Posted: Mon Oct 22, 2007 1:45 pm
by TPRJones
There's probably some way to do this with visual basic, but I don't know how to get it to hit web addresses.

If EZMacros doesn't work, let me know and I'll hack at it one night with VB.

EDIT: Okay, the Open method hits web addresses just fine, so yeah, VB could take a list of web addresses and cycle through them opening them and saving to the local drive. Not sure just how to convert to source text on the way but I'm sure it can probably be done.




Edited By TPRJones on 1193075170

Posted: Mon Oct 22, 2007 2:20 pm
by TPRJones
This doesn't work. It will pull up and save files just fine, but it doesn't get at the source code. I can't seem to figure out how to make that part work. Yet. But to yank the visible file contents, this will work fine.

Make a sheet in Excel with column A being complete web addresses (without http:// at the front, although if you want to add that you can, just take it out below), and column B is where to save the files (full drive an dpath, too, although as with the web addresses you can add the path into the script below and not have ot add it to every line of hte sheet). Make a copy of your sheet (the macro below will churn through it and delete the contents). Execute the following code on the copy:

Code: Select all

   Do While ActiveSheet.Range("A1").Value <> ""
        Workbooks.Open Filename:="http://" + ActiveSheet.Range("A1").Value
        ActiveWorkbook.SaveAs Filename:=ActiveSheet.Range("B1").Value, FileFormat:=xlUnicodeText, CreateBackup:=False
        ActiveWindow.Close
        Rows("1:1").Select
        Selection.Delete Shift:=xlUp
        Range("A1").Select
        Loop
The source code eludes me so far. I will find a way, though.[/color]

Posted: Mon Oct 22, 2007 2:24 pm
by Malcolm
You take way too much pleasure in these technical intricacies.

Posted: Mon Oct 22, 2007 2:57 pm
by TheCatt
Anyone with BALLS would have used curl.

Posted: Mon Oct 22, 2007 2:57 pm
by TPRJones
Shush, you. This is what I do for fun.

I don't have FrontPage, and that's probably where the library with the command I'd need to make this work is hiding. If there's a way to view HTML source code in Word or Excel or Publisher, I sure can't find it.

I like VB beause it's on almost everyone's computer these days making it the most widely installed compiler I'm aware of. But it's pretty heavily tied into the Office suite, and that can be a bit annoying when you are trying to do something they weren't really meant to do.

Posted: Mon Oct 22, 2007 3:28 pm
by Malcolm
Well, hell, the other technical issue you might take orgasmic pleasure in is this ...

I log into the admin panel for a site going thru some weird login page using something that's obviously not a standard HTML form, but when you look at the source, you can clearly make out where the username & password go. All that said, I can't figure out how to make it submit the form w\ my desired credentials, so I can then go in & browse the admin panel (where all the info I want is).

Normally I just pick thru webpages by grabbing them in .txt format w\ Python & parsing them later at my leisure. However, I can't figure out a way to log in to the site from the Python code. Which means I can't get whatever info is used to verify that the connection is secure, which itself means I never get to see the HTML for the admin panel.

So, I did some digging & found ways to feed cookies into Python code (I guessed they were using a cookie -- stupid me). It appears to be some kind of mutant ASP session id that I cannot, in any way, shape, or form, even find in the browser's cache in any usable format.

Then I thought I could hack my way in w\ Javascript & just steal things using getElementById. I do not have permission to grab elements by id. Fucking unreal. If they went to as much trouble designing their interfaces as they do their security measures (keeping me from MY OWN DATA nonetheless), the end result might not suck ass so hard. Fuck you, Monster Commerce.

Posted: Mon Oct 22, 2007 3:57 pm
by TPRJones
Ah, well in that case even if I could fix the view-source problem, that bit of code wouldn't help you any.

That does sound ugly. If you can just get logged in, EZMacros sounds like the way to go.




Edited By TPRJones on 1193083114

Posted: Mon Oct 22, 2007 5:17 pm
by TheCatt
CURL does logins.

Posted: Mon Oct 22, 2007 5:27 pm
by Malcolm
& it can save the page source for each of the ~1000 pages I need?

Posted: Mon Oct 22, 2007 5:30 pm
by TheCatt
Yes. IT CAN DO EVERYTHING.

I know we've covered this before.

Posted: Mon Oct 22, 2007 5:34 pm
by TheCatt
Yeah, we did.

It does HTTP and FTP automation like nothing else. You may need to have a scripting language around it to variablize things, but it can do everything you've mentioned.

Posted: Mon Oct 22, 2007 6:57 pm
by Malcolm
Alright, for Christ's sake. Is this something that I can hit the command line w\ if need be? Like if things really suck & I need to break out the fucking DOS script?

Posted: Mon Oct 22, 2007 7:16 pm
by TheCatt
Curl lives for the command line.

Posted: Mon Oct 22, 2007 7:26 pm
by Malcolm
Damn, they have made that app into something of a sledgehammer, haven't they?

Posted: Fri Oct 26, 2007 3:32 pm
by Malcolm
Alright, it's time to apply the sledgehammer since all other less brutal options have seemingly vanished. A thing that worries me, in attempting to raid the info w\ Python, there's apparently a way to login, but for one transaction request only. In other words, Python doesn't seem to remember whatever validation the other site wants. Normally, I'd just chalk this up to something fucked up w\ Python, but these fuckers have proven unusually resilient w\ their defenses.

Posted: Fri Oct 26, 2007 4:32 pm
by TheCatt
Something like... a cookie? trap the cookies with curl.