Web Analytics Made Easy -
StatCounter Regex spotting urls - CodingForum

Announcement

Collapse
No announcement yet.

Regex spotting urls

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regex spotting urls

    Hello, dear javascripters.
    I'm in need of some help saving some time creating a links from text parser.
    Is there a good method any of you guys want to share with me? I lack any knowledge of the subject and google returns a vast ocean to navigate. I want to hop a ride on a speed boat. I don't need a code, but I realize it's easier to post an example than it is to explain. I've seen a little bit about the bits and pieces and I could spend a week trying to figure it out and getting blisters on my fingertips from the googling, but I'm hoping someone will show me a wheel before I set out trying to make squares roll.
    My plan is to make a function that takes a users a text input (for a forum chat window) and senses the urls. I just need to figure out a method to
    spot blocks of text with urls and seperate img urls and hrefs and swfs and ,.... wow, this could get tricky. LOL

    I can do it, I just need a nudge in the right direction. Is there a good script anyone can reference? Any help will be greatly appreciated.


    Thank you, InterNets, your patience with me is admirable.

  • #2
    Well, the real problem comes in when users omit both the http:// protocol and the "www." prefix. And, of course, some sites don't *use* "www.", so then it's not even the user's fault: msdn.microsoft.com

    Now, it's probably not too hard to find all URLs that end in specific file types. That is, even something like
    msdn.microsoft.com/some/directory/someimage.jpg
    should be findable.

    But when it's just the bare site name or a bare directory in the site...I don't think there is a full and general solution.

    You might *guess* based on the site type: .com, .edu, .org, etc. But what about things like rtfm.atrax.co.uk ?? Yeah, that's a legit site. Granted, you might recognize ".co.uk" as being equivalent to ".com", but how many variations on that are there??

    So... 100% coverage? I don't think so. 90% or so coverage? Probably.

    Is that good enough?
    Be yourself. No one else is as qualified.

    Comment


    • #3
      By the way, did you notice that this site did *NOT* recognize ANY of those examples I gave as URLs??

      Yet it would handle
      www.atrax.co.uk

      All because of the magic "www."

      Slightly lame.
      Be yourself. No one else is as qualified.

      Comment


      • #4
        I was thinking if the string had a .something extension i could make it a link, test it to see if there's a page (somehow LOL another problem) and if there's no refering page the links get edited later on by zapping the nodes. Pretty simple,no?

        Comment


        • #5
          a quick google turned up this. didn't even test it yet.
          <img src="http://www.somehost.com/myimage.jpg" onerror="alert('Image missing')">
          Code:
          <script language="JavaScript"><!-
          function testImage(URL) {
              var tester=new Image();
              tester.onLoad=isGood;
              tester.onError=isBad;
              tester.src=URL;
          }
          
          function isGood() {
              alert('That image exists!');
          }
          
          function isBad() {
              alert('That image does no exist!');
          }
          //--></script>

          Comment


          • #6
            Works for images, but what about (say) .php or .asp pages??

            I don't think that you need to test for valid url most of the time. That is, if you see "http://" or "www." or "xxx.yyy.com" (or .org, .net, .edu, etc.), I think you can just assume it's okay. Only on that ones which are really suspicious would I go to the trouble of validating the URL. You could use XMLHTTP (or it's equivalent) to do this, you know. Just hit the URL and see if you get back a 200 status. You don't really need/want to read the content (and if it's a ".pdf" file you wouldn't want to, for example). You just want to know that you aren't getting a 404 error or something of that ilk.

            Or do the equivalent server side. Dunno what server language you prefer, but it's pretty easy to do this in ASP or JSP or ASP.NET and I assume it would be similarly easy in PHP.
            Be yourself. No one else is as qualified.

            Comment


            • #7
              Originally posted by Old Pedant View Post
              Works for images, but what about (say) .php or .asp pages??

              I don't think that you need to test for valid url most of the time. That is, if you see "http://" or "www." or "xxx.yyy.com" (or .org, .net, .edu, etc.), I think you can just assume it's okay. Only on that ones which are really suspicious would I go to the trouble of validating the URL. You could use XMLHTTP (or it's equivalent) to do this, you know. Just hit the URL and see if you get back a 200 status. You don't really need/want to read the content (and if it's a ".pdf" file you wouldn't want to, for example). You just want to know that you aren't getting a 404 error or something of that ilk.

              Or do the equivalent server side. Dunno what server language you prefer, but it's pretty easy to do this in ASP or JSP or ASP.NET and I assume it would be similarly easy in PHP.
              Yikes!! good points! You're a huge help. I think I really just want to create a pron filter and naughty words too, which you already taught me last week. So, if I can check for images, I then check the url string for known pron sites.

              I might need help making a function to record those sites and reference them in an easy fashion .... for purposes of blocking, of course.

              Comment

              Working...
              X