Web Analytics Made Easy -
StatCounter regexp to ignore html tags - CodingForum

Announcement

Collapse
No announcement yet.

regexp to ignore html tags

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • regexp to ignore html tags

    I have a page that displays the first 150 characters from a database field and then '.....[more]'. This information from the database includes html tags. The trouble is that the 150 sometimes falls mid tag so i am left with something like -


    <opening tag>lots of text, 150 characters to be exact<closeing ta

    which then throws the rest of my page. I thought the best way to get around this was to use a reg exp that does 'get 1st 150 characters, ignoring everything between < and >'

    Is this best, if so how do I do it?!!
    Monkey

    My head hurts!

  • #2
    Can you post an example of the file (or contents of the database field as a string) that you're trying to parse (or send it to me), and a clear explanation of what you DO want to get from it?

    I just did this a couple of times in the last month.
    Former ASP Forum Moderator - I'm back!

    If you can teach yourself how to learn, you can learn anything. ;)

    Comment


    • #3
      <P>Do you know where your kids are? Do you ask them where they are going and who they are with? Do you tell them what time they should be home?</P>
      <P>Now that the dark evenings are here it is ever more important to be aware of what your kids are up to. Many older residents are intimidated by groups of kids hanging around in the daytime. This is heightened in the hours of darkness. There are many of our older residents who become virtual prisoners in their own homes in the winter, rather than come into contact with our young people.</P>


      This is an example. On my page I want a snippit of the code (first 150 characters) then a link to read the whole article. Just say that the 150th character in this example is the 'p' from the first closing </p> tag - this means that there is an invalid html on the page </p, which has caused problems for the rest of the page.

      To see this in use - www.nhw-wilts.org.uk/local_pages/general.asp (you have to choose a location) then click on newsletter
      How do I get a reg exp to get the first 150 characters NOT INCLUDING html tags in the count?

      cheers
      Last edited by Boxhead; Feb 19, 2004, 04:34 AM.
      Monkey

      My head hurts!

      Comment


      • #4
        I think you should not have saved html tags in the database in the first place. The db must only contain pure data. Formatting should be done in the browser. If the data only contain <p> tags, you could have saved the data with 2 linebreaks in it.
        Glenn
        vBulletin Mods That Rock!

        Comment


        • #5
          Not possible. The text is taken from rich text editor, used in a cms to enable standard (no html knowledge) users to write their own articles.
          Monkey

          My head hurts!

          Comment


          • #6
            Perhaps using Server.HTMLEncode() on the 150-character snippet will be sufficient to solve the problem for the context in which you need it.

            Otherwise, you should pass the entire text through RegEx replace to remove the HTML (with a pattern like "(<[^>]*?>)"), and THEN grab the first 150 characters.




            EDIT - for some reason, vB keeps adding a space between the ">" tag and the ")" bracket... remove it from the pattern if/when you copy it.
            Last edited by [email protected]; Feb 19, 2004, 09:38 AM.
            Marcus Tucker / www / blog
            Web Analyst Programmer / Voted SPF "ASP Guru"

            Comment


            • #7
              I have -

              fullString = (Recordset_art.Fields.Item("article").Value)
              Set objRegExpr = New regexp
              objRegExpr.Pattern = "<[^>]*?>"
              Set showString = objRegExpr.Execute(fullString)

              If I do

              showString=showString(0)
              Response.Write(showString)

              I get <p> - if I change the showString(0) to showString(1) it errors.

              How do I get this regexp to find ALL the HTML tags, not just the first, and how do I then remove them from fullString?

              Monkey
              Monkey

              My head hurts!

              Comment


              • #8
                I said a RegEx *replace*!!!
                Code:
                showString = objRegExpr.Replace(fullString, "")


                EDIT: I've also noticed that for some reason you've removed the brackets that I had put inside the pattern... put them back! And add a line to set .Global to True!
                Last edited by [email protected]; Feb 20, 2004, 09:53 AM.
                Marcus Tucker / www / blog
                Web Analyst Programmer / Voted SPF "ASP Guru"

                Comment


                • #9
                  phew! these reg exp are a bit tricky

                  Got that working, but it only remove the first tag. Do I need to loop through the whole string or someting?!

                  (I'm sure if I ever met you and Glenngv, you would both give me a right pasting -Iand I deserve it )

                  cheers!
                  Monkey

                  My head hurts!

                  Comment


                  • #10
                    I'm going to give you a right pasting if you keep failing to read my posts!!! Read my previous post again!!! (hint: the solution's in the EDIT bit)


                    Marcus Tucker / www / blog
                    Web Analyst Programmer / Voted SPF "ASP Guru"

                    Comment


                    • #11
                      WOW!!

                      Firstly, dont worry, I gave myself a good pasting for you!!.

                      After further investigation I have found some stuff.

                      1. the reason the brackets were removed is because I rewrote your regexp without looking at yours ( a method I use to ensure I understand what people have told me and develop my skills - works very well), but couldn't see what the brackets were for - it works fine without - so why do you include them?

                      2.I have found this reg exp to do the same thing

                      .Pattern = "<\S[^>]*>"
                      (I gather this is a "same s**t different day" type scenario?!)

                      3.I have some confusion about the * and the ? in you reg exp, why do you use both (again, I have tried it without the ? and it still works fine).

                      Not challenging you, just getting my head around this very complex topic.

                      cheers
                      Monkey

                      My head hurts!

                      Comment


                      • #12
                        Originally posted by Boxhead
                        Not challenging you, just getting my head around this very complex topic.
                        I am always open to being challenged! (Fisticuffs, or pistols at dawn?!)

                        Originally posted by Boxhead
                        1. the reason the brackets were removed is because I rewrote your regexp without looking at yours ( a method I use to ensure I understand what people have told me and develop my skills - works very well), but couldn't see what the brackets were for - it works fine without - so why do you include them?
                        Glad to hear you don't just copy code verbatim, if only there were more like you/us... And you're quite right, the brackets aren't needed. I quickly whipped it up, tested it on my patented reg-o-matic script (!) and posted it without giving it a final once-over. I've been in Italy for the last few weeks (and still am - this is being posted from my hotel room in Venice) so I haven't been able to spend much time on the forums, and have been rushing when I have had the time. However, that's no excuse, no brackets needed!

                        Originally posted by Boxhead
                        2.I have found this reg exp to do the same thing

                        .Pattern = "<\S[^>]*>"
                        It does NOT behave in the same way... think about it... I'll be cruel... try a test string of "<>test</a>"...

                        Originally posted by Boxhead
                        3.I have some confusion about the * and the ? in you reg exp, why do you use both (again, I have tried it without the ? and it still works fine).
                        In this particular case it makes no difference, but from a performance perspective you should use non-greedy matching (*?) as much as possible, using greedy (*) only when absolutely necessary (because they do fundamentally different things and may give different results in many situations, despite *appearing* to do the same thing in most cases).

                        It's a question of using the best tool for the job - read up on the difference if you don't know...

                        http://regexblogs.com/dneimke/archiv...01/05/274.aspx
                        http://www.itworld.com/nl/perl/01112001/

                        Last edited by [email protected]; Feb 20, 2004, 08:27 PM.
                        Marcus Tucker / www / blog
                        Web Analyst Programmer / Voted SPF "ASP Guru"

                        Comment


                        • #13
                          Everytime I think I'm getting my head around this - it falls off!!

                          I think I understand the greedy thing and I think I understand the difference between the two script:

                          "<[^>]*?>"

                          this looks for <
                          Then followed by any character that isnt > ONCE
                          then the final >

                          so it will match

                          <hghg>
                          <>

                          but not

                          < or <>> (well only <> of the secong eg!)

                          The other version:

                          "<\S[^>]*>"

                          This matches < followed by 1 to any number of characters that arent >

                          So

                          <> mean it doesnt match as it finds < but the second charcater IS > so it fails to find it.


                          I'm sure there are somethings wrong in here!! Am I on the right tracks?

                          PS its 40 degrees here and naked models have taken over the government, but i'm sure your still having a lovely time over in Italy!!
                          Monkey

                          My head hurts!

                          Comment


                          • #14
                            Originally posted by Boxhead
                            I think I understand the greedy thing and I think I understand the difference between the two script:

                            "<[^>]*?>"

                            this looks for <
                            Then followed by any character that isnt > ONCE
                            then the final >
                            Not quite, it matches the opening "<" then zero or more characters which are not ">", then a ">" character.

                            Originally posted by Boxhead
                            The other version:

                            "<\S[^>]*>"

                            This matches < followed by 1 to any number of characters that arent >
                            Again, close, but no cigar. It matches the opening "<" character, then a single non whitespace character (so if there WAS a whitespace character in the test string at that position then the pattern match would fail), then zero or more characters which are not ">", then finally a ">" character.

                            Originally posted by Boxhead
                            PS its 40 degrees here and naked models have taken over the government, but i'm sure your still having a lovely time over in Italy!!
                            If only...!!

                            Sorry I can't elaborate, I'm running later for dinner... ciao!
                            Last edited by [email protected]; Feb 23, 2004, 12:44 PM.
                            Marcus Tucker / www / blog
                            Web Analyst Programmer / Voted SPF "ASP Guru"

                            Comment


                            • #15
                              Wow, do they still actually say "fisticuffs" in England?
                              Former ASP Forum Moderator - I'm back!

                              If you can teach yourself how to learn, you can learn anything. ;)

                              Comment

                              Working...
                              X